musl - High-level binary format for new locale files

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20251120033043.GA8089@brightrain.aerifal.cx>
Date: Wed, 19 Nov 2025 22:30:43 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: High-level binary format for new locale files

The following is a draft that I've had pending for a while now,
regarding the binary format to be mmapped/processed at runtime, which
I put aside for a while to focus on what the localization-work-facing
source format would look like. It still needs some polishing and
fleshing out that will happen alongside out-of-tree implementation of
code for processing the locale data, but I think it's useful to have
the high-level design written up in public where it can be discussed
and used as reference in the future. This is part of the locale
support overhaul project, funded by NLnet and the NGI Zero Core Fund.




On a high level, the format is a multi-level table mapping "paths" of
integer keys to (usually textual) data blobs. First, some motivating
principles:

An important goal is that the built-in C locale data (langinfo,
strerror family, etc.) should be able to be represented in the exact
same form as an external locale file. This makes it so that we don't
need to have two versions of all of the lookup code, one for the
existing internal data and another for processing locale files. It
also means that we can get rid of some of the inefficient
linear-search logic for the built-in data now, making both
non-localized and localized performance better.

This kind of linear search elimination was already done for strerror
by Timo Teräs in commit 8343334d7b. I've been building on the same
concept (multiple inclusion of a header file defining the data, with
different context each time to expand to different parts of the table)
so that we don't need to "pre-compile" the built-in C locale data to
binary blobs like the ctype data, iconv data, nfd decomposition data,
etc. but can instead let the preprocessor do the work and keep the
data itself in editable source form.

It's also desirable that the same data format used for locale strings
(langinfo, strerror, etc.) also work for collation elements. This
doesn't entirely preclude having a single flat integer namespace of
keys (for example you could or a code onto the upper bits of
codepoints to mean "collation element") but it does suggest against
it.

With the above in mind, the high-level design looks like this:

Lookups are to be performed according to a "path" of integer keys,
where each path component may traverse one or more table levels. For
example, if top-level index 1 is langinfo strings, 1/0x20000 leads to
the ABDAY_1 string. In general there is a property,

	lookup(root,mmm/nnn) = lookup(lookup(root,mmm),nnn)

so that something (like collation) needing to perform lots of lookups
can just find its starting point in the tree once, and perform each
subsequent lookup relative to that.

Because key spread may be sparse (for example, the langinfo keys have
a category starting at bit 16 and an index within the category
starting at bit 0), individual "path components" can be represented as
multiple levels in the table structure, with base/shift defined by the
file. For example, the langinfo subtable will typically define a first
level with base 0x2000 (there are no category-0 or -1 items) and shift
16, and leaf levels for each category.

While it seems like we could just skip the ability of the data to
define its own table levels like this, and instead treat something
sparse like langinfo keys as 2 path components (using the above
example, 1/2/0 for ABDAY_1), the above goal being able to use the same
data structure and table traversal code for collation elements means
we already want the flexibility to represent sparse tables. The
specifics of collation element representation in the table structure
will be fleshed out later and may inform tuning. I am in the process
of munging base collation data to measure how large resulting tables
will be and what adjustments if any might be needed to represent the
data and do so efficiently.



To demo simplified use of the table design and a potential specific
binary format to use, I have a draft version of the include files to
produce built-in C locale data described above. These need a little
polishing still, so I'll include them in a follow-up to come soon.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.