musl - Re: Selecting locale source format

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJgzZooiidR18yF3jY0098_ugguiwB59dT2NXs4MYg8tfAF1BQ@mail.gmail.com>
Date: Wed, 17 Sep 2025 11:43:46 -0400
From: enh <enh@...gle.com>
To: musl@...ts.openwall.com
Subject: Re: Selecting locale source format

On Tue, Sep 16, 2025 at 9:14 PM Rich Felker <dalias@...c.org> wrote:
>
> I have a proposed binary format for new locale files that I'm in the
> process of writing up, but Pablo brought it to my attention that,
> while binary format (ABI) is what's important to have down and stable
> at the time we integrate into musl, pinning down the source format is
> what's important/blocking for collaboration with localization folks.
>
> I have two candidate formats in the works right now for this:
>
>
>
> Option 1: subset+extension of POSIX localedef format.
>
> The basis for this format is described in
> https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
>
> If we go this way, it would be a "subset" because (1) some parts are
> not relevant, like LC_CTYPE, which does not vary by locale,

note that that's not true for 'i' in turkish/azeri locales, for
example. (unless you meant that you plan on using the unicode cldr
data directly here.)

see the "Language-Sensitive Mappings" section of SpecialCasing.txt for
all the special cases.

> (2) some
> parts will necessarily be represented in different ways, like
> collation where we're using UCA rather than the POSIX form, and (3)
> the format just has a lot of gratuitous cruft like symbolic character
> names. It will also necessarily be extended because POSIX localedef
> has no way to represent translated error strings etc. - keys for them
> have to be added.
>
> Going this route would have the source data in a fairly compact and
> "well-known" (to certain audiences) form, but requires that the
> tooling to produce binary locale files be aware of how these fields
> translate to the data model for the binary form.
>
> A sample (should be roughly correct C/POSIX locale) is attached for
> reference.
>
>
>
>
> Option 2: human-readable/text representation of the binary form
>
> Describing this requires a basic intro to the binary form, which is a
> multi-level hierarchical table mapping a path of integer key values to
> a data blob. In text we can represent keys with symbolic constants,
> but they're just a way of writing the underlying numbers. For example
> the path strerror/0 leads to the "No error information" text,
> strerror/EACCES leads to the "Permission denied" text, etc. Here
> "strerror" just represents a number for the first-level path component
> where strerror strings are stored, subindexed by (the arch/generic
> versions of) the errno codes.
>
> Going this route mostly avoids the need for smarts in the tooling, and
> "has more flexibility" to encode things. But this also potentially
> makes the encoding seem more arbitrary to localization folks.
>
> Like in option 1, a sample (some hybrid between C/POSIX and a
> hypothetical US-English locale, whipped up quick by hand as an
> example) of one way this format could look is attached for reference.
> An obvious variant that might be friendlier/more-familiar to folks
> working with the data would be representing the same in json (which is
> easy).
>
>
>
>
> My leaning is towards option 1.
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.