musl - Re: Collation, IDN, and Unicode normalization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250529002951.GU1827@brightrain.aerifal.cx>
Date: Wed, 28 May 2025 20:29:53 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Wed, May 28, 2025 at 02:47:43AM -0400, Rich Felker wrote:
> On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote:
> > One aspect of LC_COLLATE support is that collation is supposed to be
> > invariant under canonical equivalence/different normalization forms,
> > while collation rules are best expressed in terms of NFD.
> > 
> > The most direct simple way to apply collation rules is to transform
> > into NFD (on the fly) as they're applied. A more time-efficient and
> > code-simplifying way is to apply a "canonical closure" (defined in
> > UTN#5) to the collation rules ahead of time. The cost is making the
> > collation tables larger (how much larger is something I still need to
> > quantify), but without using this approach, there is a table size cost
> > (as well as code and design for making this table efficient) to be
> > able to compute decompositions on the fly.
> 
> It turns out a good deal of the decomposition/normalization logic is
> needed even if you do the canonical closure approach, since you still
> need to deal with non-starters (characters with canonical combining
> class != 0) which may be reordered under normalization. For example,
> given:
> 
> 00C8 = 0045 0300 (E-grave)
> 1E1A = 0045 0330 (E-tilde_below)
> 
> 00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045
> 0330 0300 for the purpose of applying the collation rules (because
> 0330 has lower canonical combining class than 0300).
> 
> One could imagine expanding out a closure for all possible sequences
> that could reorder to a given collation element, but this quickly
> blows up with all the non-meaningful-but-allowed sequences of
> combining marks, and AFAICT the set isn't even finite. This means it
> always requires some sort of fallback to an iterator that runs over
> a sequence of a starter followed by a number of non-starters and reads
> off the decomposed characters in order of canonical combining class,
> without actually having working space to decode them. This will be an
> important ingredient of the implementation, and it's a bit tricky to
> do right, but tractable.

OK, so I've worked out the iterator system and it's not all that bad.
I have untested code, but it depends on a backend for the
decomposition & canonical combining class tables which I haven't done
yet.

In regards to that, I have a rather unconventional idea for combining
data tables that I'd like to put forward.

First, some numbers.

There are 1046 characters in current Unicode that decompose, most of
them just to a pair of characters, but a handful to 3 or 4. (This is
not counting Hangul Jamo, which decompose algorithmically and don't
need any tables.)

There are 934 characters with nonzero canonical combining class. These
are called "non-starters". Each has an 8-bit combining class code
associated with it. There are 55 of the possible 255 values that
actually appear.

The (pardon the pun) canonical way to represent these would be to have
a multilevel table "is_decomposing" to identify characters that
decompose, and a sorted list of decompositions to binary-search when
one is hit. And to have a similar multilevel table for combining
classes, but to have it map to the class number similar to how the
case mapping table maps to a smallish set of case-mapping rules.

However, while decomposable characters and non-starters are separate
things, for practical purposes, they go hand-in-hand. They are the
characters that need special treatment when normalizing, and you have
to process both without knowing ahead of time which you're looking at.

So, what I think may make sense is combining them into a single table,
that maps "character needing normalization processing" to a sequence
(possibly length-1, if this is a non-starter not an actual
decomposable character) of *pairs* of the form [character, canonical
combining class]. That way, you get out the canonical combining
classes you need at the same time as doing the decomposition, rather
than needing a separate step to look it up for each character that
comes out of the decomposition.

The cost of storing the extra bits for the combining class is
virtually nothing, since (1) it saves having a separate parallel data
structure, and (2) Unicode characters are 21-bit (17-bit in practice
for the relevant range), leaving well over 8 bits that are just wasted
unless you do costly-to-unpack packings.

Basically, the motivation is this: the bits are there unused, so we
might as well use them to store the data we're doing to need right
afterwards anyway, and in that case, it makes sense to use the same
representation for non-starters.

I'm not sure what the ideal format for the actual table is. I was
initially looking at a multi-level bit table to evaluate the predicate
"needs normalization processing" (same as the isw*() tables in musl
now), then binary-searching a sorted list. However, a direct
multi-level table mapping to the expansions might be an option as
well. There are currently 78 256-character-aligned blocks that have
relevant characters in them. This puts a bit table at about 2.5k, and
a full map at about 20k. Vs something like 7-10k (depending on how
nasty the encoding is) for the binary-searchable list of 1980 slots
each holding a 17-bit character number and an offset to the expansion.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.