![]() |
|
Message-ID: <20250529023741.GW1827@brightrain.aerifal.cx> Date: Wed, 28 May 2025 22:37:41 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Collation, IDN, and Unicode normalization On Wed, May 28, 2025 at 08:29:53PM -0400, Rich Felker wrote: > So, what I think may make sense is combining them into a single table, > that maps "character needing normalization processing" to a sequence > (possibly length-1, if this is a non-starter not an actual > decomposable character) of *pairs* of the form [character, canonical > combining class]. That way, you get out the canonical combining > classes you need at the same time as doing the decomposition, rather > than needing a separate step to look it up for each character that > comes out of the decomposition. > > The cost of storing the extra bits for the combining class is > virtually nothing, since (1) it saves having a separate parallel data > structure, and (2) Unicode characters are 21-bit (17-bit in practice > for the relevant range), leaving well over 8 bits that are just wasted > unless you do costly-to-unpack packings. > > Basically, the motivation is this: the bits are there unused, so we > might as well use them to store the data we're doing to need right > afterwards anyway, and in that case, it makes sense to use the same > representation for non-starters. > > I'm not sure what the ideal format for the actual table is. I was > initially looking at a multi-level bit table to evaluate the predicate > "needs normalization processing" (same as the isw*() tables in musl > now), then binary-searching a sorted list. However, a direct > multi-level table mapping to the expansions might be an option as > well. There are currently 78 256-character-aligned blocks that have > relevant characters in them. This puts a bit table at about 2.5k, and > a full map at about 20k. Vs something like 7-10k (depending on how > nasty the encoding is) for the binary-searchable list of 1980 slots > each holding a 17-bit character number and an offset to the expansion. Some more numbers: For just the real decompositions, not including non-starters, there are only 31 blocks (~8k of multi-level table) rather than 78. That makes it sound appealing not to combine them. However, that leaves 68 blocks with non-starters in them which also have to be processed in some other way. I'm pretty sure there are fewer than 16 classes per block, so we could use a 4-bit table that would get it down to about 9k. But that's still nearly as large as the combined table (like 17k vs 20k). Alternatively, we could use a binary search. At 934 entries, that's 10 steps, which is rather costly. One nice thing is that we can easily pre-filter for which characters might be non-starters using wcwidth plus a few exceptions, but it still leaves processing text with any of these combining marks inordinately expensive. (Aside: All of the exceptions have character class Mc, some of them changed from Mn in previous versions of Unicode, and it's probably an error that our current policy for wcwidth is not classifying them as nonspacing. This needs to be reviewed at some point.) One hybrid option that might sound like a reasonable compromise is using the block number (codepoint>>8) to index a list of binary search endpoints, so that the list to search is small. However the common combining characters are all grouped together in blocks 03, 1D, 2D, and a few others, so performance for these would stay pretty bad (roughly 6 step search). Overall, none of these options are sounding like a significant win over the fast option. OK, so let's see how much space that really takes: Top-level table (indexed by codepoint>>8) to select a table: 1 byte per entry, for 512 bytes. Second-level tables (indexed by codepoint&255): each one needs a few header bytes to indicate where its data in the expansion table starts, plus one byte per entry that's either the offset in the expansion table or a 0 for "no expansion". Essentially this is something like 260 bytes per block, times 78 blocks, for 20280 bytes total. Expansion table slots (indexed by second-level table): each slot is 4 bytes. 934 single-entry slots (non-starters) for 3736 bytes. 1046 decomposition entries, each 2-4 slots. 36 are 4-slot, 231 are 3-slot, and 779 2-slot. That's 2395 slots, for 9580 bytes. Total across all of these is 30372 bytes. One significant optimization I think I could make, without a cost to performance or complexity, is using the second-level table header bytes to encode the subrange of the block that has relevant codepoints in it, so that the rest can be elided. I don't have the numbers yet but I think that would cut off at least 50% of the second-level table size, maybe more. There are some blocks with only a handful of (usually consecutive or close) codepoints in them, and almost none using more than half the range. This does make it so that second-level tables no longer start and multiple-of-table-size offsets (since they're variable size), so the top-level table needs to be increased to 16-bit so it can store offsets rather than indices. But that's really no big deal. It becomes 1k instead of 512 bytes. With this change, hopefully the data can be under 20k in total. To be clear, this is size that goes in libc.so/libc.a, but doesn't appear in static-linked programs unless they call strcoll or strxfrm. Eventually it may also appear for IDN in static-linked programs that use the stub resolver. I'll try to work out a draft of the table gen code in the next few days.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.