musl - Re: Results of analysis of requirements for collation data representation

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20260524002758.GD27423@brightrain.aerifal.cx>
Date: Sat, 23 May 2026 20:27:59 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Results of analysis of requirements for collation data
 representation

On Sat, May 23, 2026 at 06:34:30PM -0400, Rich Felker wrote:
> While the specifics of best representation of the collation weights
> data for runtime use, and how to generate such a representation from
> the CLDR root data and CLDR-format locale-specific tailorings, is
> something that will continue to evolve as the locale project
> continues, the actual requirements for the data representation are
> something that can be pinned down completely right now.
> 
> This is part of the locale support overhaul project, funded by NLnet
> and the NGI Zero Core Fund.
> 
> 
> 
> At a very high level, the collation data object is a greedy mapping
> from an input sequence of NFD codepoints, optionally contextual*, to
> sequences of collation elements.
> 
> [...]

A small follow-up on non-matches and implicit rules:

Earlier messages on this topic addressed a potential need to represent
explicitly the mappings to implicit rules.

For modern radical-stroke collation order, implicit weights are not
used for ideographic characters. The radical tables generate a very
large (presently 87k) number of explicit mappings which do not admit
any simple elision. This leaves implicit weights only appling to
unassigned or noncharacter codepoints.

For traditional ideographic implicit-weights order, the weights can
either be generated explicitly at localedef time like for
radical-stroke order, or we can have a table (either hardcoded in
libc, or as part of the locale) identifying which codepoint ranges the
special implicit weight rules apply to.

In any case, codepoints which are not matched by any rule in the
collation mapping get processed on the fly for implicit weight
assignment.

This requires the collation weights data model to identify which
lead/prefix bytes are available for implicit weights to use.

For unassigned/non-ideographic codepoints, I think we only need a
single prefix, and the entire UTF-8 representation of the codepoint
can be the trailing part of the weight. That's not entirely optimal,
but optimizing the size of weights for unassigned codepoints seems
pointless. Maybe since we already have a codepoint number, just
dumping it 7 bits at a time in the low bits of 0x80 would be a little
denser and easier/faster. But either way the concept is the same and
this is easy.

For ideographic implicit weights, the data model should pin down the
particular assignment of weights. This is because the FractionalUCA
root data contains rules which define an explicit mapping in terms of
a potentially-implicit ideographic codepoint. If the implicit
assignment is left as a runtime implementation detail, the definitions
would not necessarily align at runtime.

None of this particularly informs or constains the high-level or
implementation details of the data model we use. It's just a
consideration that requires some representation before this is over.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.