![]() |
|
Message-ID: <20250611025557.GP1827@brightrain.aerifal.cx>
Date: Tue, 10 Jun 2025 22:56:00 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Locale project/LC_COLLATE update: NFD!
First milestone in the collation part of the locale project: NFD is
working.
For some context, in order to be able to apply the Unicode Collation
Algorithm, it's necessary to be able to normalize the string so that
items don't order differently depending on the choice of ordering of
combining marks or whether to use precomposed forms.
The most natural normalized form for applying collation tables is NFD
(decomposed). So, this is what I've been implementing, as an iterator
that, given a UTF-8 input string, reads off the NFD character sequence
one character at a time, in NFD order.
After writing the last steps to actually emit the tables as C arrays
today, wiring up the Unicode test vectors, and debugging a few small
issues, the full set of test vectors is passing. They are not actually
as comprehensive as I'd like; over the next few days I hope to go back
and add some broad combinatoric tests too.
Code and table size is well under my initial estimates, and not far
above the best-case ones I made along the way. Currently we're at:
.text 1089 0
.data 0 0
.bss 0 0
.rodata 14161 0
This covers characters that participate in the transformation spread
sporadically across the range U+00C0 to U+2FA1D.
The draft tablegen code and runtime code are in a git repo posted at
https://github.com/richfelker/musl-uca-draft. The runtime code is
basically in a state ready for integration when there is collation
code to use it. The tablegen is roughly the same level of polish as
things in musl-chartable-tools (where I intend for it to eventually
live) but it also contains a lot of disabled analytical code that I'll
likely strip out or refactor.
Generated tables are not in git so I'm attaching here in case anyone
wants to read and appreciate (or the opposite) the pretty-printing
without actually figuring out how to run anything.
Next steps: further validation & starting on a data format for the
collation tables themselves.
Rich
View attachment "decomp.h" of type "text/plain" (79151 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.