![]() |
|
Message-ID: <20250528064742.GS1827@brightrain.aerifal.cx> Date: Wed, 28 May 2025 02:47:43 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Collation, IDN, and Unicode normalization On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote: > One aspect of LC_COLLATE support is that collation is supposed to be > invariant under canonical equivalence/different normalization forms, > while collation rules are best expressed in terms of NFD. > > The most direct simple way to apply collation rules is to transform > into NFD (on the fly) as they're applied. A more time-efficient and > code-simplifying way is to apply a "canonical closure" (defined in > UTN#5) to the collation rules ahead of time. The cost is making the > collation tables larger (how much larger is something I still need to > quantify), but without using this approach, there is a table size cost > (as well as code and design for making this table efficient) to be > able to compute decompositions on the fly. It turns out a good deal of the decomposition/normalization logic is needed even if you do the canonical closure approach, since you still need to deal with non-starters (characters with canonical combining class != 0) which may be reordered under normalization. For example, given: 00C8 = 0045 0300 (E-grave) 1E1A = 0045 0330 (E-tilde_below) 00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045 0330 0300 for the purpose of applying the collation rules (because 0330 has lower canonical combining class than 0300). One could imagine expanding out a closure for all possible sequences that could reorder to a given collation element, but this quickly blows up with all the non-meaningful-but-allowed sequences of combining marks, and AFAICT the set isn't even finite. This means it always requires some sort of fallback to an iterator that runs over a sequence of a starter followed by a number of non-starters and reads off the decomposed characters in order of canonical combining class, without actually having working space to decode them. This will be an important ingredient of the implementation, and it's a bit tricky to do right, but tractable. So my conclusion is that the "canonical closure" approach is really a performance optimization trading speed for size, and doesn't offer much or anything in the way of simplicity. I think this means we can and should forget about it. > Separately (and not part of the locale overhaul project), IDN support > requires the capability to perform normalization into NFKC -- maybe > not for all of Unicode, but at least for the characters that could > appear in domain names. So in theory there is possibly some value to > trying to share the [de]composition tables and use them in both > directions. At least the above iterator can very much be shared here. > I know for a very old version of Unicode supported in my uuterm, > decomposition tables and code fit in under 8k. > > I'm guessing the canonical closure for the collation data will be > a lot larger than that, even if Hangul could be special-cased and > elided. But depending on what level of collation capability we want > internal to libc, independent of having a locale definition loaded > (which would be fully-shared mmapped), this size might mainly be in > locale files on disk, as opposed to decomposition tables which would > be linked into libc. > > I'll be trying to work out some quantitative data on the tradeoffs > here, but wanted to go ahead and put the topic out there, especially > since the IDN topic has come up on IRC again recently and coming up > with a good choice here might intersect with IDN stuff. The Default Unicode Collation Element Table (DUCET) contains nearly 40k explicit mappings, on top of implicit ones. Under 5000 of those are mappings for precomposed characters. For the secondary and tertiary weights, there are patterns that should collapse them down a lot, maybe as small as a fraction of a bit per entry. But the primary weights are generally not assigned systematically, at least not for most of the real-world, living-language scripts in the BMP making the size at least 16 bits each for some 16k characters, and probably close to that much for the remaining ~24k. That comes to likely 70-100k of data, which is probably way above what we'd want to bake into libc. (For example, a static-linked ls command would have it all due to sorting by strcoll/strxfrm.) So I think this points to having no baked-in default collation table (e.g. for locales to delegate to, or for some variant of the C.UTF-8 locale to use) as the right thing to do. In a way this is good policy: it means everything is up to the locale to specify the collation order, incorporating the Unicode defaults (DUCET) or not, rather than artificially favoring locales that build on it or requiring them to do so. So, next steps: 1. I know we need the above-described iterator, and it's something concrete that can be written and tested independent of anything else. So I'm going to begin working on it, and post a standalone proof-of-concept once I have something working. 2. The output of that iterator can serve as the input to a prefix tree, that would be the data structure for the collation table. This needs some further checks that it works, but I think the well-formedness conditions for collation element tables (UTS#10 section 5) ensure longest-matched-prefix is a way to select the rule to apply.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.