musl - Re: Collation, IDN, and Unicode normalization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250529023741.GW1827@brightrain.aerifal.cx>
Date: Wed, 28 May 2025 22:37:41 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Wed, May 28, 2025 at 08:29:53PM -0400, Rich Felker wrote:
> So, what I think may make sense is combining them into a single table,
> that maps "character needing normalization processing" to a sequence
> (possibly length-1, if this is a non-starter not an actual
> decomposable character) of *pairs* of the form [character, canonical
> combining class]. That way, you get out the canonical combining
> classes you need at the same time as doing the decomposition, rather
> than needing a separate step to look it up for each character that
> comes out of the decomposition.
> 
> The cost of storing the extra bits for the combining class is
> virtually nothing, since (1) it saves having a separate parallel data
> structure, and (2) Unicode characters are 21-bit (17-bit in practice
> for the relevant range), leaving well over 8 bits that are just wasted
> unless you do costly-to-unpack packings.
> 
> Basically, the motivation is this: the bits are there unused, so we
> might as well use them to store the data we're doing to need right
> afterwards anyway, and in that case, it makes sense to use the same
> representation for non-starters.
> 
> I'm not sure what the ideal format for the actual table is. I was
> initially looking at a multi-level bit table to evaluate the predicate
> "needs normalization processing" (same as the isw*() tables in musl
> now), then binary-searching a sorted list. However, a direct
> multi-level table mapping to the expansions might be an option as
> well. There are currently 78 256-character-aligned blocks that have
> relevant characters in them. This puts a bit table at about 2.5k, and
> a full map at about 20k. Vs something like 7-10k (depending on how
> nasty the encoding is) for the binary-searchable list of 1980 slots
> each holding a 17-bit character number and an offset to the expansion.

Some more numbers:

For just the real decompositions, not including non-starters, there
are only 31 blocks (~8k of multi-level table) rather than 78. That
makes it sound appealing not to combine them. However, that leaves 68
blocks with non-starters in them which also have to be processed in
some other way. I'm pretty sure there are fewer than 16 classes per
block, so we could use a 4-bit table that would get it down to about
9k. But that's still nearly as large as the combined table (like 17k
vs 20k).

Alternatively, we could use a binary search. At 934 entries, that's 10
steps, which is rather costly. One nice thing is that we can easily
pre-filter for which characters might be non-starters using wcwidth
plus a few exceptions, but it still leaves processing text with any of
these combining marks inordinately expensive.

(Aside: All of the exceptions have character class Mc, some of them
changed from Mn in previous versions of Unicode, and it's probably an
error that our current policy for wcwidth is not classifying them as
nonspacing. This needs to be reviewed at some point.)

One hybrid option that might sound like a reasonable compromise is
using the block number (codepoint>>8) to index a list of binary search
endpoints, so that the list to search is small. However the common
combining characters are all grouped together in blocks 03, 1D, 2D,
and a few others, so performance for these would stay pretty bad
(roughly 6 step search).

Overall, none of these options are sounding like a significant win
over the fast option. OK, so let's see how much space that really
takes:


Top-level table (indexed by codepoint>>8) to select a table: 1 byte
per entry, for 512 bytes.

Second-level tables (indexed by codepoint&255): each one needs a few
header bytes to indicate where its data in the expansion table starts,
plus one byte per entry that's either the offset in the expansion
table or a 0 for "no expansion". Essentially this is something like
260 bytes per block, times 78 blocks, for 20280 bytes total.

Expansion table slots (indexed by second-level table): each slot is 4
bytes. 934 single-entry slots (non-starters) for 3736 bytes. 1046
decomposition entries, each 2-4 slots. 36 are 4-slot, 231 are 3-slot,
and 779 2-slot. That's 2395 slots, for 9580 bytes.

Total across all of these is 30372 bytes.


One significant optimization I think I could make, without a cost to
performance or complexity, is using the second-level table header
bytes to encode the subrange of the block that has relevant codepoints
in it, so that the rest can be elided. I don't have the numbers yet
but I think that would cut off at least 50% of the second-level table
size, maybe more. There are some blocks with only a handful of
(usually consecutive or close) codepoints in them, and almost none
using more than half the range.

This does make it so that second-level tables no longer start and
multiple-of-table-size offsets (since they're variable size), so the
top-level table needs to be increased to 16-bit so it can store
offsets rather than indices. But that's really no big deal. It becomes
1k instead of 512 bytes.


With this change, hopefully the data can be under 20k in total.

To be clear, this is size that goes in libc.so/libc.a, but doesn't
appear in static-linked programs unless they call strcoll or strxfrm.
Eventually it may also appear for IDN in static-linked programs that
use the stub resolver.

I'll try to work out a draft of the table gen code in the next few
days.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.