musl - Re: Locale project/LC_COLLATE update: NFD!

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250611202211.GQ1827@brightrain.aerifal.cx>
Date: Wed, 11 Jun 2025 16:22:11 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Locale project/LC_COLLATE update: NFD!

On Tue, Jun 10, 2025 at 10:56:00PM -0400, Rich Felker wrote:
> First milestone in the collation part of the locale project: NFD is
> working.
> 
> For some context, in order to be able to apply the Unicode Collation
> Algorithm, it's necessary to be able to normalize the string so that
> items don't order differently depending on the choice of ordering of
> combining marks or whether to use precomposed forms.
> 
> The most natural normalized form for applying collation tables is NFD
> (decomposed). So, this is what I've been implementing, as an iterator
> that, given a UTF-8 input string, reads off the NFD character sequence
> one character at a time, in NFD order.
> 
> After writing the last steps to actually emit the tables as C arrays
> today, wiring up the Unicode test vectors, and debugging a few small
> issues, the full set of test vectors is passing. They are not actually
> as comprehensive as I'd like; over the next few days I hope to go back
> and add some broad combinatoric tests too.
> 
> Code and table size is well under my initial estimates, and not far
> above the best-case ones I made along the way. Currently we're at:
> 
> ..text              1089      0
> ..data                 0      0
> ..bss                  0      0
> ..rodata           14161      0
> 
> This covers characters that participate in the transformation spread
> sporadically across the range U+00C0 to U+2FA1D.
> 
> The draft tablegen code and runtime code are in a git repo posted at
> https://github.com/richfelker/musl-uca-draft. The runtime code is
> basically in a state ready for integration when there is collation
> code to use it. The tablegen is roughly the same level of polish as
> things in musl-chartable-tools (where I intend for it to eventually
> live) but it also contains a lot of disabled analytical code that I'll
> likely strip out or refactor.
> 
> Generated tables are not in git so I'm attaching here in case anyone
> wants to read and appreciate (or the opposite) the pretty-printing
> without actually figuring out how to run anything.
> 
> Next steps: further validation & starting on a data format for the
> collation tables themselves.

Further validation continues to pass. The test I've added asserts
that, for each line in UnicodeData.txt,

- If field6 doesn't have a canonical decomposition,
  nfd(field1)==field1, i.e. character is unchanged by NFD.

- If field6 does have a canonical decomposition,
  nfd(field1)==nfd(field6).

This is not testing anything fancy about reordering, just that the
table generation or application of the table contents did not overlook
any characters that should have mappings, or wrongly get applied to
any that shouldn't.

I think this should give good overall coverage, though. The Unicode
tests were already designed to cover most(/all?) corner cases from the
standpoint of the algorithm, and the ones I've added supplement to
cover anything that might be a corner case for the table
representation.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.