musl - Re: High-level binary format for new locale files

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <725f12372826f3cacdf8791d1f3ea2f3b67a236a.camel@postmarketos.org>
Date: Wed, 10 Dec 2025 23:56:28 +0100
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>, musl@...ts.openwall.com
Subject: Re: High-level binary format for new locale files

Thanks a lot for this work! The fact that there is already a precedent
for this work in the sense of the optimization of strerror is very
neat. To me it means that the pattern is more widly applicable, and
therefore easier to understand for people looking at the code.

I wonder if we should archive the rationale laid out here somewhere
else than in the mailing list archive. This might be very relevant for
anybody looking at the code for the first time to understand what is
going on and why. Generally newcomers also don't have an easy or fast
way to search the archives.

Best,
Pablo Correa Gomez

El mie, 19-11-2025 a las 22:30 -0500, Rich Felker escribió:
> The following is a draft that I've had pending for a while now,
> regarding the binary format to be mmapped/processed at runtime, which
> I put aside for a while to focus on what the localization-work-facing
> source format would look like. It still needs some polishing and
> fleshing out that will happen alongside out-of-tree implementation of
> code for processing the locale data, but I think it's useful to have
> the high-level design written up in public where it can be discussed
> and used as reference in the future. This is part of the locale
> support overhaul project, funded by NLnet and the NGI Zero Core Fund.
> 
> 
> 
> 
> On a high level, the format is a multi-level table mapping "paths" of
> integer keys to (usually textual) data blobs. First, some motivating
> principles:
> 
> An important goal is that the built-in C locale data (langinfo,
> strerror family, etc.) should be able to be represented in the exact
> same form as an external locale file. This makes it so that we don't
> need to have two versions of all of the lookup code, one for the
> existing internal data and another for processing locale files. It
> also means that we can get rid of some of the inefficient
> linear-search logic for the built-in data now, making both
> non-localized and localized performance better.
> 
> This kind of linear search elimination was already done for strerror
> by Timo Teräs in commit 8343334d7b. I've been building on the same
> concept (multiple inclusion of a header file defining the data, with
> different context each time to expand to different parts of the
> table)
> so that we don't need to "pre-compile" the built-in C locale data to
> binary blobs like the ctype data, iconv data, nfd decomposition data,
> etc. but can instead let the preprocessor do the work and keep the
> data itself in editable source form.
> 
> It's also desirable that the same data format used for locale strings
> (langinfo, strerror, etc.) also work for collation elements. This
> doesn't entirely preclude having a single flat integer namespace of
> keys (for example you could or a code onto the upper bits of
> codepoints to mean "collation element") but it does suggest against
> it.
> 
> With the above in mind, the high-level design looks like this:
> 
> Lookups are to be performed according to a "path" of integer keys,
> where each path component may traverse one or more table levels. For
> example, if top-level index 1 is langinfo strings, 1/0x20000 leads to
> the ABDAY_1 string. In general there is a property,
> 
>  lookup(root,mmm/nnn) = lookup(lookup(root,mmm),nnn)
> 
> so that something (like collation) needing to perform lots of lookups
> can just find its starting point in the tree once, and perform each
> subsequent lookup relative to that.
> 
> Because key spread may be sparse (for example, the langinfo keys have
> a category starting at bit 16 and an index within the category
> starting at bit 0), individual "path components" can be represented
> as
> multiple levels in the table structure, with base/shift defined by
> the
> file. For example, the langinfo subtable will typically define a
> first
> level with base 0x2000 (there are no category-0 or -1 items) and
> shift
> 16, and leaf levels for each category.
> 
> While it seems like we could just skip the ability of the data to
> define its own table levels like this, and instead treat something
> sparse like langinfo keys as 2 path components (using the above
> example, 1/2/0 for ABDAY_1), the above goal being able to use the
> same
> data structure and table traversal code for collation elements means
> we already want the flexibility to represent sparse tables. The
> specifics of collation element representation in the table structure
> will be fleshed out later and may inform tuning. I am in the process
> of munging base collation data to measure how large resulting tables
> will be and what adjustments if any might be needed to represent the
> data and do so efficiently.
> 
> 
> 
> To demo simplified use of the table design and a potential specific
> binary format to use, I have a draft version of the include files to
> produce built-in C locale data described above. These need a little
> polishing still, so I'll include them in a follow-up to come soon.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.