Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250505181848.GD1827@brightrain.aerifal.cx>
Date: Mon, 5 May 2025 14:18:49 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Collation, IDN, and Unicode normalization

One aspect of LC_COLLATE support is that collation is supposed to be
invariant under canonical equivalence/different normalization forms,
while collation rules are best expressed in terms of NFD.

The most direct simple way to apply collation rules is to transform
into NFD (on the fly) as they're applied. A more time-efficient and
code-simplifying way is to apply a "canonical closure" (defined in
UTN#5) to the collation rules ahead of time. The cost is making the
collation tables larger (how much larger is something I still need to
quantify), but without using this approach, there is a table size cost
(as well as code and design for making this table efficient) to be
able to compute decompositions on the fly.

Separately (and not part of the locale overhaul project), IDN support
requires the capability to perform normalization into NFKC -- maybe
not for all of Unicode, but at least for the characters that could
appear in domain names. So in theory there is possibly some value to
trying to share the [de]composition tables and use them in both
directions.

I know for a very old version of Unicode supported in my uuterm,
decomposition tables and code fit in under 8k.

I'm guessing the canonical closure for the collation data will be
a lot larger than that, even if Hangul could be special-cased and
elided. But depending on what level of collation capability we want
internal to libc, independent of having a locale definition loaded
(which would be fully-shared mmapped), this size might mainly be in
locale files on disk, as opposed to decomposition tables which would
be linked into libc.

I'll be trying to work out some quantitative data on the tradeoffs
here, but wanted to go ahead and put the topic out there, especially
since the IDN topic has come up on IRC again recently and coming up
with a good choice here might intersect with IDN stuff.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.