musl - Collation table data format, round 1

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250612042036.GR1827@brightrain.aerifal.cx>
Date: Thu, 12 Jun 2025 00:20:37 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Collation table data format, round 1

Following up with research towards representing the collation data for
locales:

In general, it's likely that we want to be able to derive as much of
the locale data as possible from Unicode CLDR data. (CLDR = Common
Locale Data Repository)

Today, I found that the CLDR approach to collation both builds upon
and deviates some from the baseline UCA specs. The relevant document
is UTS#35, Part 5, Collation:

https://unicode.org/reports/tr35/tr35-collation.html

In particular, it offers both a structured mechanism to define
locale-specific tailorings and reorderings in a reasonable form for
locale implementors, and (this is the part that looks most exciting
from my side) provides an alternate framework for the root collation
data designed around compact variable-length byte-sequence-based
weights.

This is in contrast to the base UCA's DUCET table, which defines
fixed-width 16-bit weights at each of the 3+ levels -- 48+ bits of
mostly-redundant data per characer, which need further massaging to
work with strxfrm since its units are bytes not 16-bit code units, and
embedded zero bytes are not possible. It looks like this development
was the outcome of folks figuring out that the UCS-2 mindset was a
mistake. So, I think this gives a strong, well-precedented basis for a
data format to use.

Assuming we go with that, the on-disk/mapped format should look
something like this: a multi-level table (probably with level
breakdown defined in the data header, so that we don't hard-code
something that ends up not working well for particular locale
tailorings or future unicode directions) that maps:

- whole blocks subject to an implicit rule to the implicit rule

- characters that don't participate in a "contraction" (many-to-many
  map) to a set of byte sequences for each of the 3 (or n?) levels.
  probably represented as a single byte sequence with some efficient
  encoding delimiting the levels.

- characters that do participate in "contraction" to a next-level
  table in a sort of prefix-tree.

Note that the default weights (in UCS DUCET; I haven't reevalutated
this for the CLDR form but it should be the same) have very few
characters participating as the start of contractions. I count 77,
most of which only have 1-3 possibilities that can follow, but some
(especially Thai, which has vowel mark reordering; I think the other
cases are similar) have 45-48 possible [sequences of] characters that
follow.

Locale-specific tailorings may have a good deal more. Normally they're
isolated to the particular language(s) in that locale, but I would
like to explore building more multilingual-friendly collation tables
that cover "everything that doesn't conflict with your language"
rather than "only what's needed for your language". This is completely
outside the domain of code for the project, more a matter of what
recipes we end up using for the official locales. But it may inform
choices about what's costly in terms of on-disk locale size.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.