musl - Re: Collation, IDN, and Unicode normalization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250531184919.GC1827@brightrain.aerifal.cx>
Date: Sat, 31 May 2025 14:49:19 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Thu, May 29, 2025 at 12:52:49PM -0400, Rich Felker wrote:
> On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote:
> > down to. If you take out the actual expansions, which are 9580 bytes,
> > it's a better comparison. Those could be packed tighter to get it down
> > further, but improving them is independent of the way we map to them.
> 
> OK, those can be cut down a lot.
> 
> The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit
> codepoint and 8-bit combining class.
> 
> However, a quick analysis shows there are only 94 characters that ever
> appear as the non-initial position of an expansion. This means the
> actual expansions can be a series of a potentially larger initial
> member followed by 7 bits (plus 1 continuation-flag bit?) for each
> subsequent character in the expansion.
> 
> That requires a 94-entry table to map the possibilities to actual
> character numbers and combining classes, but the table can easily be
> 24-bit packed rather than 32-bit if we want, as 282 bytes, or just
> aligned 32-bit as 376 bytes.
> 
> There are 317 possible characters that appear in the first slot of an
> expansion, and some of these overlap, because there are only 406 total
> that appear in any position.
> 
> We could put all of those 406 in a table, with the 94 at the beginning
> so they can be referenced by a single byte. 
> 
> For non-starters which expand to themselves, we can have an additional
> 55 slots, one for each possible combining class, with a dummy
> character value meaning "expands to itself" and only the combining
> class value being used. That increases the number of entries to 461.
> 
> That's 1383 or 1844 bytes depending on whether we pack to 3-byte
> entries.
> 
> Now, what about the reference for the initial character of each
> expansion? Do we make them 2-byte? How about instead, making all of
> the references a variable-length code. A byte x<128 represents itself;
> a byte x>=128 represents (x-128)*128+y where y is the next byte. This
> future-proofs things allowing the number of characters that could
> appear to be as high at 16384, and lets us use a single-byte for a lot
> of the leading characters of expansions too.
> 
> Or, since we're not going to need anywhere near that many, we could
> reserve only a few bits for the variable-length codind. For example,
> if x<240 represents itself, all of the 94 non-initial decomposition
> characters and all of the 55 non-starter class codes fit in a single
> byte, and we can still represent everything else in the range of
> (x-240)*240+y which reaches up to 3839 (over 8x the presently needed
> range).
> 
> Now, let's estimate sizes again:
> 
> If there are 2395 slots, 1046 of them initial, let's assume the
> initial ones take 2 bytes each (some can be less) in the expansion
> table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes
> for expansions. Plus the above 1383 bytes for the table of possible
> character, gives 4824 bytes. Basically half of what we started at.
> 
> One final note: I've kinda handwaved being able to pack the table of
> possible characters into 3 bytes each rather than 4, since they're
> 17-bit characters with 8-bit combining classes. However, in the
> non-BMP characters, only 9 of the possible 55 classes ever appear, so
> we can just steal 9 of the unassigned slots to use as "this means one
> of those 9, plus set the high bit on the character value". However,
> given that the difference in table size vs using a 32-bit unpacked
> table is only 461 bytes, and that the unpacking code probably costs
> nearly that much in .text size, it's probably better to just forget
> that and use the 4-byte version at 1844 bytes. This makes the total
> size 5285 bytes, which is still incredibly small, and it also avoids
> implementing a packing we might have to rip out later if Unicode
> expands character assignment in ways that don't fit.

Small setback that, without other changes, makes throwing away that
packing rather costly:

While there are only a small number of characters that appear in
multi-character decompositions, I failed to include single-character
decompositions ("singletons" in Unicode normalization terminology).
This brings the number (together with nonstarters self-map entries)
that appear in decompositions up to 1375. Most of these are the
characters only present for round-tripping legacy character sets, such
as Å vs "ANGSTROM SIGN", Ω vs "OHM SIGN", and of course all the "CJK
COMPATIBILITY IDEOGRAPH" codepoints.

For what it's worth, there are 3 odd compat characters:

0340;COMBINING GRAVE TONE MARK;Mn;230;NSM;0300;;;;N;NON-SPACING GRAVE TONE MARK;;;;
0341;COMBINING ACUTE TONE MARK;Mn;230;NSM;0301;;;;N;NON-SPACING ACUTE TONE MARK;;;;
0343;COMBINING GREEK KORONIS;Mn;230;NSM;0313;;;;N;;;;;

that decompose respectively to nonstarters:

0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;;;;
0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;;;;
0313;COMBINING COMMA ABOVE;Mn;230;NSM;;;;;N;NON-SPACING COMMA ABOVE;;;;

All the rest of the singletons decompose to starters (don't need to
store a combining class). Since the above 3 seem to have basically
just been a historical mistake, and the bulk of new singletons are CJK
COMPAT, I think it makes sense to optimize around assuming most
future-added singletons will be starters (not need a combining class).

So..

It's really nasty to be spending 4 or even 3 bytes per entry as a
"character that appears in decompositions" for "character that appears
once in a single decomposition because it was accidentally encoded
twice in some obscure legacy character set". This is on top of needing
an entry in, and a 2-byte reference out of, the expansions table, not
to mention increasing the number and size of populated second-level
tables.

One possibility is to allow coding characters directly in the
expansion table rather than as references to a commonly appearing
character. The VLC so far is really only defined as:

    0..239: single byte index
    240..255: head of double byte index

But since that whole range isn't needed, it could be expanded to:

    0..239: single byte index
    240..251: head of double byte index
    252..255: top 2 bits of an immediate 2-byte character that follows

This would allow representing any character up to U+3FFFF (note: some
in the 2xxxx range appear as CJK COMPAT mappings), as long as it's a
starter, without needing an entry in the list of chars that appear in
decompositions, with only 3 bytes (vs 2) in the expansion table.
That's saving 3 (or 2, if we could pack the latter down to 3 total)
bytes per singleton. It might help reduce the size of some existing
non-singleton decompositions too.

Indeed, only 135 characters appear 3 or more times in decompositions.

Including singletons, 1054 only appear once.

Excluding singletons, still 231 appear only once. (These cost more to
put in a table that to inline in the expansions.)

If we only include in the table characters which either appear at
least 3 times, or which are non-starters, double-byte indices wouldn't
get used at all. Each expansion would either be a single-byte index or
an inline character value.

If we don't actually need double-byte indices, we could in theory
recover the values used for them to encode inline characters more
efficiently. But I don't think it really helps that much. The
locations where individual characters aren't common in decompositions,
but where characters with matching upper bits are common, is fairly
sporadic, spread over the 00, 04, 05, 0F, 22, 30 blocks and the entire
CJK ideographs range. In theory we could pick like the 16 most
commonly appearing blocks and encode those in the lead byte, but it
would save a few hundred bytes total at most, and it's gratuitously
ugly and dependent on either manually picking the best blocks or
marshalling results of the optimization logic into code. I think this
is a bad direction to pursue.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.