![]() |
|
Message-ID: <20250529002951.GU1827@brightrain.aerifal.cx> Date: Wed, 28 May 2025 20:29:53 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Collation, IDN, and Unicode normalization On Wed, May 28, 2025 at 02:47:43AM -0400, Rich Felker wrote: > On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote: > > One aspect of LC_COLLATE support is that collation is supposed to be > > invariant under canonical equivalence/different normalization forms, > > while collation rules are best expressed in terms of NFD. > > > > The most direct simple way to apply collation rules is to transform > > into NFD (on the fly) as they're applied. A more time-efficient and > > code-simplifying way is to apply a "canonical closure" (defined in > > UTN#5) to the collation rules ahead of time. The cost is making the > > collation tables larger (how much larger is something I still need to > > quantify), but without using this approach, there is a table size cost > > (as well as code and design for making this table efficient) to be > > able to compute decompositions on the fly. > > It turns out a good deal of the decomposition/normalization logic is > needed even if you do the canonical closure approach, since you still > need to deal with non-starters (characters with canonical combining > class != 0) which may be reordered under normalization. For example, > given: > > 00C8 = 0045 0300 (E-grave) > 1E1A = 0045 0330 (E-tilde_below) > > 00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045 > 0330 0300 for the purpose of applying the collation rules (because > 0330 has lower canonical combining class than 0300). > > One could imagine expanding out a closure for all possible sequences > that could reorder to a given collation element, but this quickly > blows up with all the non-meaningful-but-allowed sequences of > combining marks, and AFAICT the set isn't even finite. This means it > always requires some sort of fallback to an iterator that runs over > a sequence of a starter followed by a number of non-starters and reads > off the decomposed characters in order of canonical combining class, > without actually having working space to decode them. This will be an > important ingredient of the implementation, and it's a bit tricky to > do right, but tractable. OK, so I've worked out the iterator system and it's not all that bad. I have untested code, but it depends on a backend for the decomposition & canonical combining class tables which I haven't done yet. In regards to that, I have a rather unconventional idea for combining data tables that I'd like to put forward. First, some numbers. There are 1046 characters in current Unicode that decompose, most of them just to a pair of characters, but a handful to 3 or 4. (This is not counting Hangul Jamo, which decompose algorithmically and don't need any tables.) There are 934 characters with nonzero canonical combining class. These are called "non-starters". Each has an 8-bit combining class code associated with it. There are 55 of the possible 255 values that actually appear. The (pardon the pun) canonical way to represent these would be to have a multilevel table "is_decomposing" to identify characters that decompose, and a sorted list of decompositions to binary-search when one is hit. And to have a similar multilevel table for combining classes, but to have it map to the class number similar to how the case mapping table maps to a smallish set of case-mapping rules. However, while decomposable characters and non-starters are separate things, for practical purposes, they go hand-in-hand. They are the characters that need special treatment when normalizing, and you have to process both without knowing ahead of time which you're looking at. So, what I think may make sense is combining them into a single table, that maps "character needing normalization processing" to a sequence (possibly length-1, if this is a non-starter not an actual decomposable character) of *pairs* of the form [character, canonical combining class]. That way, you get out the canonical combining classes you need at the same time as doing the decomposition, rather than needing a separate step to look it up for each character that comes out of the decomposition. The cost of storing the extra bits for the combining class is virtually nothing, since (1) it saves having a separate parallel data structure, and (2) Unicode characters are 21-bit (17-bit in practice for the relevant range), leaving well over 8 bits that are just wasted unless you do costly-to-unpack packings. Basically, the motivation is this: the bits are there unused, so we might as well use them to store the data we're doing to need right afterwards anyway, and in that case, it makes sense to use the same representation for non-starters. I'm not sure what the ideal format for the actual table is. I was initially looking at a multi-level bit table to evaluate the predicate "needs normalization processing" (same as the isw*() tables in musl now), then binary-searching a sorted list. However, a direct multi-level table mapping to the expansions might be an option as well. There are currently 78 256-character-aligned blocks that have relevant characters in them. This puts a bit table at about 2.5k, and a full map at about 20k. Vs something like 7-10k (depending on how nasty the encoding is) for the binary-searchable list of 1980 slots each holding a 17-bit character number and an offset to the expansion.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.