musl - Re: Collation, IDN, and Unicode normalization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250528064742.GS1827@brightrain.aerifal.cx>
Date: Wed, 28 May 2025 02:47:43 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Mon, May 05, 2025 at 02:18:49PM -0400, Rich Felker wrote:
> One aspect of LC_COLLATE support is that collation is supposed to be
> invariant under canonical equivalence/different normalization forms,
> while collation rules are best expressed in terms of NFD.
> 
> The most direct simple way to apply collation rules is to transform
> into NFD (on the fly) as they're applied. A more time-efficient and
> code-simplifying way is to apply a "canonical closure" (defined in
> UTN#5) to the collation rules ahead of time. The cost is making the
> collation tables larger (how much larger is something I still need to
> quantify), but without using this approach, there is a table size cost
> (as well as code and design for making this table efficient) to be
> able to compute decompositions on the fly.

It turns out a good deal of the decomposition/normalization logic is
needed even if you do the canonical closure approach, since you still
need to deal with non-starters (characters with canonical combining
class != 0) which may be reordered under normalization. For example,
given:

00C8 = 0045 0300 (E-grave)
1E1A = 0045 0330 (E-tilde_below)

00C8 0330 is equivalent to 1E1A 0300, and both must be treated as 0045
0330 0300 for the purpose of applying the collation rules (because
0330 has lower canonical combining class than 0300).

One could imagine expanding out a closure for all possible sequences
that could reorder to a given collation element, but this quickly
blows up with all the non-meaningful-but-allowed sequences of
combining marks, and AFAICT the set isn't even finite. This means it
always requires some sort of fallback to an iterator that runs over
a sequence of a starter followed by a number of non-starters and reads
off the decomposed characters in order of canonical combining class,
without actually having working space to decode them. This will be an
important ingredient of the implementation, and it's a bit tricky to
do right, but tractable.

So my conclusion is that the "canonical closure" approach is really a
performance optimization trading speed for size, and doesn't offer
much or anything in the way of simplicity. I think this means we can
and should forget about it.


> Separately (and not part of the locale overhaul project), IDN support
> requires the capability to perform normalization into NFKC -- maybe
> not for all of Unicode, but at least for the characters that could
> appear in domain names. So in theory there is possibly some value to
> trying to share the [de]composition tables and use them in both
> directions.

At least the above iterator can very much be shared here.

> I know for a very old version of Unicode supported in my uuterm,
> decomposition tables and code fit in under 8k.
> 
> I'm guessing the canonical closure for the collation data will be
> a lot larger than that, even if Hangul could be special-cased and
> elided. But depending on what level of collation capability we want
> internal to libc, independent of having a locale definition loaded
> (which would be fully-shared mmapped), this size might mainly be in
> locale files on disk, as opposed to decomposition tables which would
> be linked into libc.
> 
> I'll be trying to work out some quantitative data on the tradeoffs
> here, but wanted to go ahead and put the topic out there, especially
> since the IDN topic has come up on IRC again recently and coming up
> with a good choice here might intersect with IDN stuff.

The Default Unicode Collation Element Table (DUCET) contains nearly
40k explicit mappings, on top of implicit ones. Under 5000 of those
are mappings for precomposed characters. For the secondary and
tertiary weights, there are patterns that should collapse them down a
lot, maybe as small as a fraction of a bit per entry. But the primary
weights are generally not assigned systematically, at least not for
most of the real-world, living-language scripts in the BMP making the
size at least 16 bits each for some 16k characters, and probably close
to that much for the remaining ~24k. That comes to likely 70-100k of
data, which is probably way above what we'd want to bake into libc.
(For example, a static-linked ls command would have it all due to
sorting by strcoll/strxfrm.)

So I think this points to having no baked-in default collation table
(e.g. for locales to delegate to, or for some variant of the C.UTF-8
locale to use) as the right thing to do. In a way this is good policy:
it means everything is up to the locale to specify the collation
order, incorporating the Unicode defaults (DUCET) or not, rather than
artificially favoring locales that build on it or requiring them to do
so.


So, next steps:

1. I know we need the above-described iterator, and it's something
concrete that can be written and tested independent of anything else.
So I'm going to begin working on it, and post a standalone
proof-of-concept once I have something working.

2. The output of that iterator can serve as the input to a prefix
tree, that would be the data structure for the collation table. This
needs some further checks that it works, but I think the
well-formedness conditions for collation element tables (UTS#10
section 5) ensure longest-matched-prefix is a way to select the rule
to apply.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.