![]() |
|
Message-ID: <20250529165249.GB1827@brightrain.aerifal.cx> Date: Thu, 29 May 2025 12:52:49 -0400 From: Rich Felker <dalias@...c.org> To: Nick Wellnhofer <wellnhofer@...um.de> Cc: musl@...ts.openwall.com Subject: Re: Collation, IDN, and Unicode normalization On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote: > down to. If you take out the actual expansions, which are 9580 bytes, > it's a better comparison. Those could be packed tighter to get it down > further, but improving them is independent of the way we map to them. OK, those can be cut down a lot. The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit codepoint and 8-bit combining class. However, a quick analysis shows there are only 94 characters that ever appear as the non-initial position of an expansion. This means the actual expansions can be a series of a potentially larger initial member followed by 7 bits (plus 1 continuation-flag bit?) for each subsequent character in the expansion. That requires a 94-entry table to map the possibilities to actual character numbers and combining classes, but the table can easily be 24-bit packed rather than 32-bit if we want, as 282 bytes, or just aligned 32-bit as 376 bytes. There are 317 possible characters that appear in the first slot of an expansion, and some of these overlap, because there are only 406 total that appear in any position. We could put all of those 406 in a table, with the 94 at the beginning so they can be referenced by a single byte. For non-starters which expand to themselves, we can have an additional 55 slots, one for each possible combining class, with a dummy character value meaning "expands to itself" and only the combining class value being used. That increases the number of entries to 461. That's 1383 or 1844 bytes depending on whether we pack to 3-byte entries. Now, what about the reference for the initial character of each expansion? Do we make them 2-byte? How about instead, making all of the references a variable-length code. A byte x<128 represents itself; a byte x>=128 represents (x-128)*128+y where y is the next byte. This future-proofs things allowing the number of characters that could appear to be as high at 16384, and lets us use a single-byte for a lot of the leading characters of expansions too. Or, since we're not going to need anywhere near that many, we could reserve only a few bits for the variable-length codind. For example, if x<240 represents itself, all of the 94 non-initial decomposition characters and all of the 55 non-starter class codes fit in a single byte, and we can still represent everything else in the range of (x-240)*240+y which reaches up to 3839 (over 8x the presently needed range). Now, let's estimate sizes again: If there are 2395 slots, 1046 of them initial, let's assume the initial ones take 2 bytes each (some can be less) in the expansion table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes for expansions. Plus the above 1383 bytes for the table of possible character, gives 4824 bytes. Basically half of what we started at. One final note: I've kinda handwaved being able to pack the table of possible characters into 3 bytes each rather than 4, since they're 17-bit characters with 8-bit combining classes. However, in the non-BMP characters, only 9 of the possible 55 classes ever appear, so we can just steal 9 of the unassigned slots to use as "this means one of those 9, plus set the high bit on the character value". However, given that the difference in table size vs using a 32-bit unpacked table is only 461 bytes, and that the unpacking code probably costs nearly that much in .text size, it's probably better to just forget that and use the 4-byte version at 1844 bytes. This makes the total size 5285 bytes, which is still incredibly small, and it also avoids implementing a packing we might have to rip out later if Unicode expands character assignment in ways that don't fit.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.