musl - musl multi-level table format for binary locale images

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260509032227.GA23680@brightrain.aerifal.cx>
Date: Fri, 8 May 2026 23:22:28 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: musl multi-level table format for binary locale images

The concepts here have been presented before; what follows is an
informal spec of the actual mappable image format that has emerged
from earlier design proposals and discussion and from implementation
of draft tooling.

The multi-level tables used here are in some sense a data-driven
generalization of the multi-level tables used elsewhere in musl for
character data, with headers at each level defining the range covered
and bits examined at the level covered.

Some of the details here are not yet matched by the draft tooling, and
all may still be subject to further change until integration and
release.

This is part of the locale support overhaul project, funded by NLnet
and the NGI Zero Core Fund.





FILE FORMAT


Main file header:

TBD. This mainly needs some arbitrary magic to make the files easily
identifiable. Any further extensibility needed can be covered by
having a top-level table index reserved for storing additional fields.


Table structure;

	be32 start;
	u8 shift;
	u8 scale;
	be16 size;
	union {
		u8 offsets8[size];
		be16 offsets16[size/2];
		be32 offsets32[size/4];
	}
	u8 data[];

This represents a table of offsets for a range of integer key values
beginning at start. Keys are processed as unsigned 32-bit values, but
can represent a signed range crossing 0 as needed. Offsets may be
encoded as unsigned 8-, 16-, or 32-bit values.

Offsets are relative to the byte before the start of data[], so that
an offset value of 1 indicates the first byte. An offset of 0
indicates that the key is not defined.

If shift is nonzero, the offset obtained at index (key-start)>>shift
in the offsets array leads to a subtable of the same form that will
take the remainder (key-state)&((1<<shift)-1) as its input; the
process continues iteratively this way until reaching a subtable where
shift is zero.

If shift is zero, the offset obtained at index key-start holds the
value associated with the key.


Path hierarchy structure:

The associated data for a key may be another table, creating a tree
structure addressed by tuples of integer keys. As all offsets are
interpreted as non-negative, loops are impossible, but elision of
common subtrees is possible by storing the subtree beyond the last
point that needs to refer to it.

In general there is no type information or other metadata stored
alongside data in the tables. It is up to the process performing the
lookup to know whether the value obtianed is to be interpreted as
another table, as a string, or as some other binary data object.



USAGE FOR LOCALE DATA


Layout principles:

In order both to avoid locking down implementation requirements/costs
and to facilitate representing the constant C locale data easily in
source-level C structures, the hierarchical path structure is used
extensively. This leaves it an open implementation choice whether to
always perform lookups from the root or cache pointers to deeper
levels where they might be a bottleneck (as in collation processing).

The top-level keys are broad categories of data. Under some, such as
langinfo and errors, there are multiple subcategories. The
implementation may choose to store pointers to a category like errors,
or even to a subcategory like strerror messages.

In some cases, such as the langinfo strings, the subcategory
(corresponding to LC_* locale category) may appear redundant with the
high bits already in the integer key value; a 2-level table with
shift==16 at the first level would yield a very similar structure
putting the keys directly under the top-level langinfo table. However,
doing it that way would preclude storing a pointer to the subtable for
a particular category (such as LC_TIME). Since, either way, two levels
need to be traversed, it's preferable to use an explicit path level.


Top-level keys:

[These do not match the draft tooling; I will be updating it to match,
along with adding binary output capability.]

header = 0
localeconv = 1
langinfo = 2
collation = 3
errors = 4
messages = 5


Path layout:

localeconv/-1: binary data for the char fields of struct lconv, in the
order they appear in the ISO C specification and in musl locale.h.
These are to be copied at locale load-time, replacing \xff with \x7f
on archs where char is signed.

localeconv/0..9: string data for the first 10 fields of struct lconv,
likewise in the order they appear in the specification and in musl.
Items 2 and 7 consist of a pair of strings separated by a null
terminator byte, the first one for use on archs with signed char and
the second for user on archs with unsigned char.

langinfo/LC_TIME/$item: nl_langinfo strings for LC_TIME

langinfo/LC_MESSAGES/$item: nl_langinfo strings for LC_MESSAGES

errors/0/$err: strerror strings

errors/1/$err: hstrerror strings

errors/2/$err: gai_strerror strings

errors/3/$err: regerror strings

collation/*: TBD


Examples of data encoding:

langinfo/LC_TIME:

	start = 131072
	shift = 0
	scale = 0
	size = .....
	offsets8[] = {
		1, 5, 9, 13, 17, 21, 25, 29, 36, ...
	}
	data[] = "Sun\0Mon\0Tue\0Wed\0Thu\0Fri\0Sat\0Sunday\0..."

errors/1:

	start = -1
	shift = 0
	scale = 0
	size = .....
	offsets16[] = {
		1, 15, 36, 60, ...
	}
	data[] = "Unknown error\0"
	         "No error information\0"
	         "Operation not permitted\0"
	         "No such file or directory\0"
	         ...
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.