|
|
Message-ID: <20260524002758.GD27423@brightrain.aerifal.cx> Date: Sat, 23 May 2026 20:27:59 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Results of analysis of requirements for collation data representation On Sat, May 23, 2026 at 06:34:30PM -0400, Rich Felker wrote: > While the specifics of best representation of the collation weights > data for runtime use, and how to generate such a representation from > the CLDR root data and CLDR-format locale-specific tailorings, is > something that will continue to evolve as the locale project > continues, the actual requirements for the data representation are > something that can be pinned down completely right now. > > This is part of the locale support overhaul project, funded by NLnet > and the NGI Zero Core Fund. > > > > At a very high level, the collation data object is a greedy mapping > from an input sequence of NFD codepoints, optionally contextual*, to > sequences of collation elements. > > [...] A small follow-up on non-matches and implicit rules: Earlier messages on this topic addressed a potential need to represent explicitly the mappings to implicit rules. For modern radical-stroke collation order, implicit weights are not used for ideographic characters. The radical tables generate a very large (presently 87k) number of explicit mappings which do not admit any simple elision. This leaves implicit weights only appling to unassigned or noncharacter codepoints. For traditional ideographic implicit-weights order, the weights can either be generated explicitly at localedef time like for radical-stroke order, or we can have a table (either hardcoded in libc, or as part of the locale) identifying which codepoint ranges the special implicit weight rules apply to. In any case, codepoints which are not matched by any rule in the collation mapping get processed on the fly for implicit weight assignment. This requires the collation weights data model to identify which lead/prefix bytes are available for implicit weights to use. For unassigned/non-ideographic codepoints, I think we only need a single prefix, and the entire UTF-8 representation of the codepoint can be the trailing part of the weight. That's not entirely optimal, but optimizing the size of weights for unassigned codepoints seems pointless. Maybe since we already have a codepoint number, just dumping it 7 bits at a time in the low bits of 0x80 would be a little denser and easier/faster. But either way the concept is the same and this is easy. For ideographic implicit weights, the data model should pin down the particular assignment of weights. This is because the FractionalUCA root data contains rules which define an explicit mapping in terms of a potentially-implicit ideographic codepoint. If the implicit assignment is left as a runtime implementation detail, the definitions would not necessarily align at runtime. None of this particularly informs or constains the high-level or implementation details of the data model we use. It's just a consideration that requires some representation before this is over.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.