musl - Re: Selecting locale source format

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <7B7869C8-5B94-4157-96DE-2B09B12BC64A@Wilcox-Tech.com>
Date: Tue, 16 Sep 2025 20:23:09 -0500
From: "A. Wilcox" <AWilcox@...cox-Tech.com>
To: musl@...ts.openwall.com
Subject: Re: Selecting locale source format

On Sep 16, 2025, at 20:14, Rich Felker <dalias@...c.org> wrote:
> 
> I have a proposed binary format for new locale files that I'm in the
> process of writing up, but Pablo brought it to my attention that,
> while binary format (ABI) is what's important to have down and stable
> at the time we integrate into musl, pinning down the source format is
> what's important/blocking for collaboration with localization folks.
> 
> I have two candidate formats in the works right now for this:
> 
> 
> 
> Option 1: subset+extension of POSIX localedef format.
> 
> The basis for this format is described in
> https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
> 
> If we go this way, it would be a "subset" because (1) some parts are
> not relevant, like LC_CTYPE, which does not vary by locale, (2) some
> parts will necessarily be represented in different ways, like
> collation where we're using UCA rather than the POSIX form, and (3)
> the format just has a lot of gratuitous cruft like symbolic character
> names. It will also necessarily be extended because POSIX localedef
> has no way to represent translated error strings etc. - keys for them
> have to be added.
> 
> Going this route would have the source data in a fairly compact and
> "well-known" (to certain audiences) form, but requires that the
> tooling to produce binary locale files be aware of how these fields
> translate to the data model for the binary form.
> 
> A sample (should be roughly correct C/POSIX locale) is attached for
> reference.
> 
> 
> 
> 
> Option 2: human-readable/text representation of the binary form
> 
> Describing this requires a basic intro to the binary form, which is a
> multi-level hierarchical table mapping a path of integer key values to
> a data blob. In text we can represent keys with symbolic constants,
> but they're just a way of writing the underlying numbers. For example
> the path strerror/0 leads to the "No error information" text,
> strerror/EACCES leads to the "Permission denied" text, etc. Here
> "strerror" just represents a number for the first-level path component
> where strerror strings are stored, subindexed by (the arch/generic
> versions of) the errno codes.
> 
> Going this route mostly avoids the need for smarts in the tooling, and
> "has more flexibility" to encode things. But this also potentially
> makes the encoding seem more arbitrary to localization folks.
> 
> Like in option 1, a sample (some hybrid between C/POSIX and a
> hypothetical US-English locale, whipped up quick by hand as an
> example) of one way this format could look is attached for reference.
> An obvious variant that might be friendlier/more-familiar to folks
> working with the data would be representing the same in json (which is
> easy).
> 
> 
> 
> 
> My leaning is towards option 1.
> 
> <sample_posix_localedef.txt><sample_binary_as_text.txt>

Hi Rich,

Thanks for continuing the locale work - very happy to see it
progressing!

I definitely prefer option 1 as well.  This will allow an easy
migration path for people using other Unix or Unix-like systems
(Solaris, AIX, glibc Linux) where localedef is also used.  It also
means there is also a large corpus of existing files we can use,
both for testing the tooling and for initial drafts at porting musl
to other locales.

I think it is reasonable to extend the file to handle translations
for days of the week/months.  Is there a reason the existing system
of gettext(3) can’t be used for strerror_l?

Best,
-Anna
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.