![]() |
|
Message-ID: <20250612042036.GR1827@brightrain.aerifal.cx> Date: Thu, 12 Jun 2025 00:20:37 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Collation table data format, round 1 Following up with research towards representing the collation data for locales: In general, it's likely that we want to be able to derive as much of the locale data as possible from Unicode CLDR data. (CLDR = Common Locale Data Repository) Today, I found that the CLDR approach to collation both builds upon and deviates some from the baseline UCA specs. The relevant document is UTS#35, Part 5, Collation: https://unicode.org/reports/tr35/tr35-collation.html In particular, it offers both a structured mechanism to define locale-specific tailorings and reorderings in a reasonable form for locale implementors, and (this is the part that looks most exciting from my side) provides an alternate framework for the root collation data designed around compact variable-length byte-sequence-based weights. This is in contrast to the base UCA's DUCET table, which defines fixed-width 16-bit weights at each of the 3+ levels -- 48+ bits of mostly-redundant data per characer, which need further massaging to work with strxfrm since its units are bytes not 16-bit code units, and embedded zero bytes are not possible. It looks like this development was the outcome of folks figuring out that the UCS-2 mindset was a mistake. So, I think this gives a strong, well-precedented basis for a data format to use. Assuming we go with that, the on-disk/mapped format should look something like this: a multi-level table (probably with level breakdown defined in the data header, so that we don't hard-code something that ends up not working well for particular locale tailorings or future unicode directions) that maps: - whole blocks subject to an implicit rule to the implicit rule - characters that don't participate in a "contraction" (many-to-many map) to a set of byte sequences for each of the 3 (or n?) levels. probably represented as a single byte sequence with some efficient encoding delimiting the levels. - characters that do participate in "contraction" to a next-level table in a sort of prefix-tree. Note that the default weights (in UCS DUCET; I haven't reevalutated this for the CLDR form but it should be the same) have very few characters participating as the start of contractions. I count 77, most of which only have 1-3 possibilities that can follow, but some (especially Thai, which has vowel mark reordering; I think the other cases are similar) have 45-48 possible [sequences of] characters that follow. Locale-specific tailorings may have a good deal more. Normally they're isolated to the particular language(s) in that locale, but I would like to explore building more multilingual-friendly collation tables that cover "everything that doesn't conflict with your language" rather than "only what's needed for your language". This is completely outside the domain of code for the project, more a matter of what recipes we end up using for the official locales. But it may inform choices about what's costly in terms of on-disk locale size. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.