musl - Re: High-level binary collation data-format/processing

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4cd9d7a45242442300276e4f5acc35441e9a4093.camel@postmarketos.org>
Date: Thu, 11 Dec 2025 00:07:06 +0100
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>, musl@...ts.openwall.com
Subject: Re: High-level binary collation data-format/processing

Hi Rich,

In a previous email about "High-level binary format for new locale
files" you wrote:

> To demo simplified use of the table design and a potential specific
> binary format to use, I have a draft version of the include files to
> produce built-in C locale data described above. These need a little
> polishing still, so I'll include them in a follow-up to come soon.

I assume you were referring to this email. Is there any code that you
can share now? I think given the algorithmic complexity of the work
discussed, it might be a lot easier for more people to chime in if they
can have a look at some draft code, even if imperfect or incomplete.


El mie, 10-12-2025 a las 10:36 -0500, Rich Felker escribió:
> The following is an initial draft for how the runtime data
> representation and processing of collation weights should happen in
> musl's locale implementation with collation (LC_COLLATE, strcoll,
> strxfrm) functionality. This is part of the locale support overhaul
> project, funded by NLnet and the NGI Zero Core Fund.
> 
> 
> 
> The collation weight lookup process consists of taking the iterated
> NFD-order codepoints from the input string (see past work for how
> this
> is iterated) and using them as keys in the collation weights subtree
> of the memory-mapped binary locale file.
> 
> Ordinary leaves of the table are sequences of collation elements
> corresponding to the input, in CLDR FractionalUCA byte sequence form,
> concatenated within each weight level. For example, if a mapping
> yields a sequence of 2 collation elements A and B, with weights
> [A1,A2,A3] and [B1,B2,B3] respectively (where each of A1...B3 may be
> 0
> or more bytes), the concatenated form is P A1 B1 A2 B2 A3 B3 where P
> is a prefix representing the lengths of A1 B1, A2 B2, and A3 B3 to
> delimit them.
> 
> This "weight-level-major" packing is motivated by how the data is
> processed: on each pass at a given level, the sequence of bytes
> corresponding to the weights of all the output collating elements
> need
> to be emitted to the output buffer (strxfrm) or compared (strcoll).
> But it doesn't matter whether they came from 1 or 2 or 3 or N
> collating elements. Not only would an "element-major" form take more
> space in the table; it would also require costlier iteration to
> collect the needed bytes at strcoll/strxfrm time.
> 
> The form of the prefix P is not yet specified, but my intent is that
> it be a single byte for frequently-appearing small length
> combinations, with the ability to encode arbitrary lengths when
> needed. I'll follow up with an analysis of FractionalUCA.txt showing
> the frequency of each combination that appears in the base collation
> data.
> 
> 
> 
> Since UCA supports many-to-one and many-to-many mappings
> ("contractions") of codepoints to collation elements, not all table
> entries will be ordinary leaves. A key may map to a subtable indexed
> by next codepoint, with a default leaf for when the next codepoint
> does not participate in a contraction.
> 
> 
> 
> If we also want to support implicit weights and multiple types of
> implicit-weight assignment (mainly for legacy sorting of CJK, rather
> than Unihan radical-stroke order) without expanding them out to be
> explicit weights (which is always a possibility), there should be a
> mechanism for representing implicit weight rules as a leaf type, so
> that entire ranges can map to the same copy of the rule. It's not
> clear whether this is worth doing. It may make sense for this to be a
> separate table (only referenced after failure to find the codepont in
> the explicit mapping table). Design for this needs to be expanded
> upon.
> 
> 
> 
> CLDR collation algorithm extends UCA by offering prefix rules
> ("context-sensitive mappings") to reduce the number of contractions
> needed and theoretically improve performance. My expectation is that
> the performance benefits of using this will be relatively quite
> small,
> so my leaning is not to even try to encode/implement prefix matching
> but instead collapse any that appear in the source data to standard
> UCA-form contractions. But this is open to reconsideration if anyone
> has strong arguments that we should support first-class prefixes.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.