![]() |
|
Message-ID: <20250531184919.GC1827@brightrain.aerifal.cx> Date: Sat, 31 May 2025 14:49:19 -0400 From: Rich Felker <dalias@...c.org> To: Nick Wellnhofer <wellnhofer@...um.de> Cc: musl@...ts.openwall.com Subject: Re: Collation, IDN, and Unicode normalization On Thu, May 29, 2025 at 12:52:49PM -0400, Rich Felker wrote: > On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote: > > down to. If you take out the actual expansions, which are 9580 bytes, > > it's a better comparison. Those could be packed tighter to get it down > > further, but improving them is independent of the way we map to them. > > OK, those can be cut down a lot. > > The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit > codepoint and 8-bit combining class. > > However, a quick analysis shows there are only 94 characters that ever > appear as the non-initial position of an expansion. This means the > actual expansions can be a series of a potentially larger initial > member followed by 7 bits (plus 1 continuation-flag bit?) for each > subsequent character in the expansion. > > That requires a 94-entry table to map the possibilities to actual > character numbers and combining classes, but the table can easily be > 24-bit packed rather than 32-bit if we want, as 282 bytes, or just > aligned 32-bit as 376 bytes. > > There are 317 possible characters that appear in the first slot of an > expansion, and some of these overlap, because there are only 406 total > that appear in any position. > > We could put all of those 406 in a table, with the 94 at the beginning > so they can be referenced by a single byte. > > For non-starters which expand to themselves, we can have an additional > 55 slots, one for each possible combining class, with a dummy > character value meaning "expands to itself" and only the combining > class value being used. That increases the number of entries to 461. > > That's 1383 or 1844 bytes depending on whether we pack to 3-byte > entries. > > Now, what about the reference for the initial character of each > expansion? Do we make them 2-byte? How about instead, making all of > the references a variable-length code. A byte x<128 represents itself; > a byte x>=128 represents (x-128)*128+y where y is the next byte. This > future-proofs things allowing the number of characters that could > appear to be as high at 16384, and lets us use a single-byte for a lot > of the leading characters of expansions too. > > Or, since we're not going to need anywhere near that many, we could > reserve only a few bits for the variable-length codind. For example, > if x<240 represents itself, all of the 94 non-initial decomposition > characters and all of the 55 non-starter class codes fit in a single > byte, and we can still represent everything else in the range of > (x-240)*240+y which reaches up to 3839 (over 8x the presently needed > range). > > Now, let's estimate sizes again: > > If there are 2395 slots, 1046 of them initial, let's assume the > initial ones take 2 bytes each (some can be less) in the expansion > table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes > for expansions. Plus the above 1383 bytes for the table of possible > character, gives 4824 bytes. Basically half of what we started at. > > One final note: I've kinda handwaved being able to pack the table of > possible characters into 3 bytes each rather than 4, since they're > 17-bit characters with 8-bit combining classes. However, in the > non-BMP characters, only 9 of the possible 55 classes ever appear, so > we can just steal 9 of the unassigned slots to use as "this means one > of those 9, plus set the high bit on the character value". However, > given that the difference in table size vs using a 32-bit unpacked > table is only 461 bytes, and that the unpacking code probably costs > nearly that much in .text size, it's probably better to just forget > that and use the 4-byte version at 1844 bytes. This makes the total > size 5285 bytes, which is still incredibly small, and it also avoids > implementing a packing we might have to rip out later if Unicode > expands character assignment in ways that don't fit. Small setback that, without other changes, makes throwing away that packing rather costly: While there are only a small number of characters that appear in multi-character decompositions, I failed to include single-character decompositions ("singletons" in Unicode normalization terminology). This brings the number (together with nonstarters self-map entries) that appear in decompositions up to 1375. Most of these are the characters only present for round-tripping legacy character sets, such as Å vs "ANGSTROM SIGN", Ω vs "OHM SIGN", and of course all the "CJK COMPATIBILITY IDEOGRAPH" codepoints. For what it's worth, there are 3 odd compat characters: 0340;COMBINING GRAVE TONE MARK;Mn;230;NSM;0300;;;;N;NON-SPACING GRAVE TONE MARK;;;; 0341;COMBINING ACUTE TONE MARK;Mn;230;NSM;0301;;;;N;NON-SPACING ACUTE TONE MARK;;;; 0343;COMBINING GREEK KORONIS;Mn;230;NSM;0313;;;;N;;;;; that decompose respectively to nonstarters: 0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;;;; 0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;;;; 0313;COMBINING COMMA ABOVE;Mn;230;NSM;;;;;N;NON-SPACING COMMA ABOVE;;;; All the rest of the singletons decompose to starters (don't need to store a combining class). Since the above 3 seem to have basically just been a historical mistake, and the bulk of new singletons are CJK COMPAT, I think it makes sense to optimize around assuming most future-added singletons will be starters (not need a combining class). So.. It's really nasty to be spending 4 or even 3 bytes per entry as a "character that appears in decompositions" for "character that appears once in a single decomposition because it was accidentally encoded twice in some obscure legacy character set". This is on top of needing an entry in, and a 2-byte reference out of, the expansions table, not to mention increasing the number and size of populated second-level tables. One possibility is to allow coding characters directly in the expansion table rather than as references to a commonly appearing character. The VLC so far is really only defined as: 0..239: single byte index 240..255: head of double byte index But since that whole range isn't needed, it could be expanded to: 0..239: single byte index 240..251: head of double byte index 252..255: top 2 bits of an immediate 2-byte character that follows This would allow representing any character up to U+3FFFF (note: some in the 2xxxx range appear as CJK COMPAT mappings), as long as it's a starter, without needing an entry in the list of chars that appear in decompositions, with only 3 bytes (vs 2) in the expansion table. That's saving 3 (or 2, if we could pack the latter down to 3 total) bytes per singleton. It might help reduce the size of some existing non-singleton decompositions too. Indeed, only 135 characters appear 3 or more times in decompositions. Including singletons, 1054 only appear once. Excluding singletons, still 231 appear only once. (These cost more to put in a table that to inline in the expansions.) If we only include in the table characters which either appear at least 3 times, or which are non-starters, double-byte indices wouldn't get used at all. Each expansion would either be a single-byte index or an inline character value. If we don't actually need double-byte indices, we could in theory recover the values used for them to encode inline characters more efficiently. But I don't think it really helps that much. The locations where individual characters aren't common in decompositions, but where characters with matching upper bits are common, is fairly sporadic, spread over the 00, 04, 05, 0F, 22, 30 blocks and the entire CJK ideographs range. In theory we could pick like the 16 most commonly appearing blocks and encode those in the lead byte, but it would save a few hundred bytes total at most, and it's gratuitously ugly and dependent on either manually picking the best blocks or marshalling results of the optimization logic into code. I think this is a bad direction to pursue.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.