musl - High-level binary collation data-format/processing

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251210153641.GA31827@brightrain.aerifal.cx>
Date: Wed, 10 Dec 2025 10:36:43 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: High-level binary collation data-format/processing

The following is an initial draft for how the runtime data
representation and processing of collation weights should happen in
musl's locale implementation with collation (LC_COLLATE, strcoll,
strxfrm) functionality. This is part of the locale support overhaul
project, funded by NLnet and the NGI Zero Core Fund.



The collation weight lookup process consists of taking the iterated
NFD-order codepoints from the input string (see past work for how this
is iterated) and using them as keys in the collation weights subtree
of the memory-mapped binary locale file.

Ordinary leaves of the table are sequences of collation elements
corresponding to the input, in CLDR FractionalUCA byte sequence form,
concatenated within each weight level. For example, if a mapping
yields a sequence of 2 collation elements A and B, with weights
[A1,A2,A3] and [B1,B2,B3] respectively (where each of A1...B3 may be 0
or more bytes), the concatenated form is P A1 B1 A2 B2 A3 B3 where P
is a prefix representing the lengths of A1 B1, A2 B2, and A3 B3 to
delimit them.

This "weight-level-major" packing is motivated by how the data is
processed: on each pass at a given level, the sequence of bytes
corresponding to the weights of all the output collating elements need
to be emitted to the output buffer (strxfrm) or compared (strcoll).
But it doesn't matter whether they came from 1 or 2 or 3 or N
collating elements. Not only would an "element-major" form take more
space in the table; it would also require costlier iteration to
collect the needed bytes at strcoll/strxfrm time.

The form of the prefix P is not yet specified, but my intent is that
it be a single byte for frequently-appearing small length
combinations, with the ability to encode arbitrary lengths when
needed. I'll follow up with an analysis of FractionalUCA.txt showing
the frequency of each combination that appears in the base collation
data.



Since UCA supports many-to-one and many-to-many mappings
("contractions") of codepoints to collation elements, not all table
entries will be ordinary leaves. A key may map to a subtable indexed
by next codepoint, with a default leaf for when the next codepoint
does not participate in a contraction.



If we also want to support implicit weights and multiple types of
implicit-weight assignment (mainly for legacy sorting of CJK, rather
than Unihan radical-stroke order) without expanding them out to be
explicit weights (which is always a possibility), there should be a
mechanism for representing implicit weight rules as a leaf type, so
that entire ranges can map to the same copy of the rule. It's not
clear whether this is worth doing. It may make sense for this to be a
separate table (only referenced after failure to find the codepont in
the explicit mapping table). Design for this needs to be expanded
upon.



CLDR collation algorithm extends UCA by offering prefix rules
("context-sensitive mappings") to reduce the number of contractions
needed and theoretically improve performance. My expectation is that
the performance benefits of using this will be relatively quite small,
so my leaning is not to even try to encode/implement prefix matching
but instead collapse any that appear in the source data to standard
UCA-form contractions. But this is open to reconsideration if anyone
has strong arguments that we should support first-class prefixes.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.