musl - Collation weight length frequencies

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260408175507.GA3975@brightrain.aerifal.cx>
Date: Wed, 8 Apr 2026 13:55:07 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Collation weight length frequencies

This is a followup to working out the details of the runtime format
for collation data designed around the Unicode CLDR FractionalUCA
(byte-based rather than legacy 16-bit-unit-based) format. This is part
of the locale support overhaul project, funded by NLnet and the NGI
Zero Core Fund.



Attached is an approximate table of frequencies of weight length
combinations (at levels 1, 2, and 3) for collation mappings from the
default order in FractionalUCA.txt. It's missing a small number of
rules defined in terms of other rules which my parser isn't
processing, and I think 2 prefix-context rules, but otherwise it
should be complete.

What it's showing is that the vast majority of rules are 1 byte in the
nonprimary levels and 3, 2, or 1 byte at the primary level. A
plurality of those that remain are the full ignoreables (0 bytes at
all levels). So, from a standpoint of coding efficiency for the
mmappable runtime locale file, we only really need a compact
(single-byte) way to represent (1,1,1), (2,1,1), (3,1,1), and (0,0,0).
Anything else could be represented with explicit lengths with low
cost relative to the overall data size.

Rules that are more than one byte in nonprimary levels all arise from
mappings of one character or a sequence of characters to more than one
collation element. Patterns like (4,2,2) and (6,2,2) are actually
pairs of (2,1,1) and (3,1,1) collation elements, We don't need this
distinction for strcoll/strxfrm usage, but it may be useful to somehow
delimit them or flag them as multi-collation-element entries if we
will at some point have regex bracket support for collation elements.



So, where does this leave us?

One obvious choice we could make is just reserving 4 special header
byte values for (0,0,0), (1,1,1), (2,1,1), and (3,1,1), and explicitly
encoding anything else. This avoids the need for any significant
"unpacking" and keeps plenty of encoding space available, but it does
assume the distribution will remain similar after tailoring.

Another obvious choice is using 2 bits for the length at each level,
or 3 at the primary level and 2 at each of the two remaining levels.
With 3 bits for primary, there would only be around 150 rules using
any longer explicit encoding, and it would be unlikely that tailorings
add significantly more.

This second options actually leaves a fair bit of encoding space free,
too. Aside from the space with the last 1-2 bits set, the space with
nonzero length at primary level but 0 length at any of the remaining
levels is not valid for weights. I doubt we need this, though.

Here is a draft of a possible encoding:

0 | ppp | ss | tt   - 3 bit primary, 2 bit secondary, 2 bit tertiary
1 | 1..127          - primary length 1..127, then 1 byte each for s/t
1 | 0               - vlc for each of p/s/t

The third case would never be used in practice but is just there for
completeness/support of gratuitously bad weight byte assignments. It
probably shouldn't even be implemented, just left as theoretical
encoding space.

This doesn't account for storing any knowledge of whether the mapping
is a multi-collation-element one or any sort of delimiters for
multiple CEs. If we want to be able to store that, we could require
that the form with the high bit set (explicit lengths) be used
whenever it's multi-CE, drop the range to 1..63, and use the
additional recovered bit to flag that it's multi-CE (and optionally
store delimiter info in this case).


With any of these encodings, the size of the default collation table,
with implicit codepoint-order rules for ideographs, is looking to be
about 200k. I'd expect a little over double that with modern
stroke-radical order for ideographs.


One thing all of this would be throwing away is that there is very low
information content in the secondary and tertiary levels -- not just
the lengths but the actual weights. It might make more sense to
consider an encoding like the first option above that I didn't
elaborate on as much, but where the header byte encodes not just the
lengths but an index into a table of common values for
secondary/tertiary level. This would shave 50k off the default table
(more with stroke-radical order). Another way to accomplish this might
be using the (otherwise invalid) length-0 for secondary/tertiary to
compress common values. However, it might just be the case that the
relative size here is sufficiently low that we don't want to bother
with complexity.


Rich

View attachment "freq.txt" of type "text/plain" (744 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.