musl - Re: Collation, IDN, and Unicode normalization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250529165249.GB1827@brightrain.aerifal.cx>
Date: Thu, 29 May 2025 12:52:49 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Thu, May 29, 2025 at 10:03:57AM -0400, Rich Felker wrote:
> down to. If you take out the actual expansions, which are 9580 bytes,
> it's a better comparison. Those could be packed tighter to get it down
> further, but improving them is independent of the way we map to them.

OK, those can be cut down a lot.

The 9580 figure was from 2395 slots each 32-bit, each holding a 17-bit
codepoint and 8-bit combining class.

However, a quick analysis shows there are only 94 characters that ever
appear as the non-initial position of an expansion. This means the
actual expansions can be a series of a potentially larger initial
member followed by 7 bits (plus 1 continuation-flag bit?) for each
subsequent character in the expansion.

That requires a 94-entry table to map the possibilities to actual
character numbers and combining classes, but the table can easily be
24-bit packed rather than 32-bit if we want, as 282 bytes, or just
aligned 32-bit as 376 bytes.

There are 317 possible characters that appear in the first slot of an
expansion, and some of these overlap, because there are only 406 total
that appear in any position.

We could put all of those 406 in a table, with the 94 at the beginning
so they can be referenced by a single byte. 

For non-starters which expand to themselves, we can have an additional
55 slots, one for each possible combining class, with a dummy
character value meaning "expands to itself" and only the combining
class value being used. That increases the number of entries to 461.

That's 1383 or 1844 bytes depending on whether we pack to 3-byte
entries.

Now, what about the reference for the initial character of each
expansion? Do we make them 2-byte? How about instead, making all of
the references a variable-length code. A byte x<128 represents itself;
a byte x>=128 represents (x-128)*128+y where y is the next byte. This
future-proofs things allowing the number of characters that could
appear to be as high at 16384, and lets us use a single-byte for a lot
of the leading characters of expansions too.

Or, since we're not going to need anywhere near that many, we could
reserve only a few bits for the variable-length codind. For example,
if x<240 represents itself, all of the 94 non-initial decomposition
characters and all of the 55 non-starter class codes fit in a single
byte, and we can still represent everything else in the range of
(x-240)*240+y which reaches up to 3839 (over 8x the presently needed
range).

Now, let's estimate sizes again:

If there are 2395 slots, 1046 of them initial, let's assume the
initial ones take 2 bytes each (some can be less) in the expansion
table and the rest take 1 byte each. That's 1349 + 2*1046 = 3441 bytes
for expansions. Plus the above 1383 bytes for the table of possible
character, gives 4824 bytes. Basically half of what we started at.

One final note: I've kinda handwaved being able to pack the table of
possible characters into 3 bytes each rather than 4, since they're
17-bit characters with 8-bit combining classes. However, in the
non-BMP characters, only 9 of the possible 55 classes ever appear, so
we can just steal 9 of the unassigned slots to use as "this means one
of those 9, plus set the high bit on the character value". However,
given that the difference in table size vs using a 32-bit unpacked
table is only 461 bytes, and that the unpacking code probably costs
nearly that much in .text size, it's probably better to just forget
that and use the 4-byte version at 1844 bytes. This makes the total
size 5285 bytes, which is still incredibly small, and it also avoids
implementing a packing we might have to rip out later if Unicode
expands character assignment in ways that don't fit.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.