![]() |
|
Message-ID: <20250505181848.GD1827@brightrain.aerifal.cx> Date: Mon, 5 May 2025 14:18:49 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Collation, IDN, and Unicode normalization One aspect of LC_COLLATE support is that collation is supposed to be invariant under canonical equivalence/different normalization forms, while collation rules are best expressed in terms of NFD. The most direct simple way to apply collation rules is to transform into NFD (on the fly) as they're applied. A more time-efficient and code-simplifying way is to apply a "canonical closure" (defined in UTN#5) to the collation rules ahead of time. The cost is making the collation tables larger (how much larger is something I still need to quantify), but without using this approach, there is a table size cost (as well as code and design for making this table efficient) to be able to compute decompositions on the fly. Separately (and not part of the locale overhaul project), IDN support requires the capability to perform normalization into NFKC -- maybe not for all of Unicode, but at least for the characters that could appear in domain names. So in theory there is possibly some value to trying to share the [de]composition tables and use them in both directions. I know for a very old version of Unicode supported in my uuterm, decomposition tables and code fit in under 8k. I'm guessing the canonical closure for the collation data will be a lot larger than that, even if Hangul could be special-cased and elided. But depending on what level of collation capability we want internal to libc, independent of having a locale definition loaded (which would be fully-shared mmapped), this size might mainly be in locale files on disk, as opposed to decomposition tables which would be linked into libc. I'll be trying to work out some quantitative data on the tradeoffs here, but wanted to go ahead and put the topic out there, especially since the IDN topic has come up on IRC again recently and coming up with a good choice here might intersect with IDN stuff. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.