|
|
Message-ID: <20251120033043.GA8089@brightrain.aerifal.cx> Date: Wed, 19 Nov 2025 22:30:43 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: High-level binary format for new locale files The following is a draft that I've had pending for a while now, regarding the binary format to be mmapped/processed at runtime, which I put aside for a while to focus on what the localization-work-facing source format would look like. It still needs some polishing and fleshing out that will happen alongside out-of-tree implementation of code for processing the locale data, but I think it's useful to have the high-level design written up in public where it can be discussed and used as reference in the future. This is part of the locale support overhaul project, funded by NLnet and the NGI Zero Core Fund. On a high level, the format is a multi-level table mapping "paths" of integer keys to (usually textual) data blobs. First, some motivating principles: An important goal is that the built-in C locale data (langinfo, strerror family, etc.) should be able to be represented in the exact same form as an external locale file. This makes it so that we don't need to have two versions of all of the lookup code, one for the existing internal data and another for processing locale files. It also means that we can get rid of some of the inefficient linear-search logic for the built-in data now, making both non-localized and localized performance better. This kind of linear search elimination was already done for strerror by Timo Teräs in commit 8343334d7b. I've been building on the same concept (multiple inclusion of a header file defining the data, with different context each time to expand to different parts of the table) so that we don't need to "pre-compile" the built-in C locale data to binary blobs like the ctype data, iconv data, nfd decomposition data, etc. but can instead let the preprocessor do the work and keep the data itself in editable source form. It's also desirable that the same data format used for locale strings (langinfo, strerror, etc.) also work for collation elements. This doesn't entirely preclude having a single flat integer namespace of keys (for example you could or a code onto the upper bits of codepoints to mean "collation element") but it does suggest against it. With the above in mind, the high-level design looks like this: Lookups are to be performed according to a "path" of integer keys, where each path component may traverse one or more table levels. For example, if top-level index 1 is langinfo strings, 1/0x20000 leads to the ABDAY_1 string. In general there is a property, lookup(root,mmm/nnn) = lookup(lookup(root,mmm),nnn) so that something (like collation) needing to perform lots of lookups can just find its starting point in the tree once, and perform each subsequent lookup relative to that. Because key spread may be sparse (for example, the langinfo keys have a category starting at bit 16 and an index within the category starting at bit 0), individual "path components" can be represented as multiple levels in the table structure, with base/shift defined by the file. For example, the langinfo subtable will typically define a first level with base 0x2000 (there are no category-0 or -1 items) and shift 16, and leaf levels for each category. While it seems like we could just skip the ability of the data to define its own table levels like this, and instead treat something sparse like langinfo keys as 2 path components (using the above example, 1/2/0 for ABDAY_1), the above goal being able to use the same data structure and table traversal code for collation elements means we already want the flexibility to represent sparse tables. The specifics of collation element representation in the table structure will be fleshed out later and may inform tuning. I am in the process of munging base collation data to measure how large resulting tables will be and what adjustments if any might be needed to represent the data and do so efficiently. To demo simplified use of the table design and a potential specific binary format to use, I have a draft version of the include files to produce built-in C locale data described above. These need a little polishing still, so I'll include them in a follow-up to come soon.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.