musl - Results of analysis of requirements for collation data representation

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260523223429.GC27423@brightrain.aerifal.cx>
Date: Sat, 23 May 2026 18:34:30 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Results of analysis of requirements for collation data representation

While the specifics of best representation of the collation weights
data for runtime use, and how to generate such a representation from
the CLDR root data and CLDR-format locale-specific tailorings, is
something that will continue to evolve as the locale project
continues, the actual requirements for the data representation are
something that can be pinned down completely right now.

This is part of the locale support overhaul project, funded by NLnet
and the NGI Zero Core Fund.



At a very high level, the collation data object is a greedy mapping
from an input sequence of NFD codepoints, optionally contextual*, to
sequences of collation elements.

At this abstract level, it doesn't even matter what collation elements
are/look-like. But in practice, each collation element consists of 3
weights, where each weight is a sequence of zero or more non-null
bytes. There are further constraints on the structure of these
weights, but they are not important at this level.

(*) The mapping may be context-dependent, in that it may assert a
prefix from previously-consumed codepoints. This is not a strictly
required feature, and it was not present in the original UCA; prefix
rules can be transformed to rules that don't use prefix matching. The
documented motivation for doing prefixes is performance. This seems
dubious, at least with the implementation I'm going with for
multi-character sequence matching, but the ability to use rules
written with prefixes, which appear in real CLDR data, without having
to transform them, seems worthwhile, so I am including it in the
model.

Per TR#35
https://unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings

prefixes are matched before contraction suffixes (any part of the
forward match beyond the first codepoint), and are greedy in the same
way the main contraction suffix matches are.

So. the keys in the mapping lookup can be represented as:

base/prefix-1/prefix-2/.../prefix-K/suffix+1/suffix+2/.../suffix+N

By requirements of the UCA, whenever a key of the form A/B/C/D exists,
keys of the form A/B/C, A/B, and A must also exist. These can be
represented as A/B/C/D, A/B/C/0, A/B/0, and A/0. so that a particular
level is never both final and non-final.

At each level of such a greedy mapping up until the first suffix, the
next component can be either a[nother] prefix codepoint or a first
suffix codepoint. This requires one additional bit of data in the
representation beyond the codepoint value. Since 0 is not a possible
codepoint value, using sign is an obvious choice: representing
prefixes as the negated codepoint value.









Use of collation data for non-strcoll/strxfrm use:

The regex engine also exposes collation with equivalence classes
[=ce=] and collating symbols [.ce.].

While it might seem there is a need to represent classes explicitly,
the regex implementation can evaluate equivalence by either (1) saving
the primary weight for the characters between the = signs at regcomp
time and looking up the primary weight of the input at regexec time,
and comparing them to evaluate the match, or (2) constructing strings
to pass to strcoll as a black box, which assess the primary
equivalence.

In particular, it is not necessary to be able to convert a
representative of the class into a set of characters. In fact the set
of matches in general will not be single characters but sequences
consisting of a base character and combining marks.

Likewise collating symbol matches can be asserted by doing the
equivalent of strxfrm on the characters between the .'s at regcomp
time, and likewise for the input at regexec time, and comparing them
byte-for-byte to evaluate the match.

Note: The TRE regex engine we use now does not support collation
elements or equivalence classes, and they are unlikely to be added in
the short-term future. What is important now however is that the data
model not preclude future support.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.