Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d8572b08bde29d12ac2e454d50d488cf5aaab42.camel@postmarketos.org>
Date: Mon, 02 Mar 2026 14:22:18 +0100
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>, "A. Wilcox" <AWilcox@...cox-Tech.com>
Cc: musl@...ts.openwall.com
Subject: Re: Selecting locale source format

El Tue, 16-09-2025 a las 21:36 -0400, Rich Felker escribió:
> On Tue, Sep 16, 2025 at 08:23:09PM -0500, A. Wilcox wrote:
> > On Sep 16, 2025, at 20:14, Rich Felker <dalias@...c.org> wrote:
> > > 
> > > I have a proposed binary format for new locale files that I'm in the
> > > process of writing up, but Pablo brought it to my attention that,
> > > while binary format (ABI) is what's important to have down and stable
> > > at the time we integrate into musl, pinning down the source format is
> > > what's important/blocking for collaboration with localization folks.
> > > 
> > > I have two candidate formats in the works right now for this:
> > > 
> > > 
> > > 
> > > Option 1: subset+extension of POSIX localedef format.
> > > 
> > > The basis for this format is described in
> > > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
> > > 
> > > If we go this way, it would be a "subset" because (1) some parts are
> > > not relevant, like LC_CTYPE, which does not vary by locale, (2) some
> > > parts will necessarily be represented in different ways, like
> > > collation where we're using UCA rather than the POSIX form, and (3)
> > > the format just has a lot of gratuitous cruft like symbolic character
> > > names. It will also necessarily be extended because POSIX localedef
> > > has no way to represent translated error strings etc. - keys for them
> > > have to be added.
> > > 
> > > Going this route would have the source data in a fairly compact and
> > > "well-known" (to certain audiences) form, but requires that the
> > > tooling to produce binary locale files be aware of how these fields
> > > translate to the data model for the binary form.
> > > 
> > > A sample (should be roughly correct C/POSIX locale) is attached for
> > > reference.
> > > 
> > > 
> > > 
> > > 
> > > Option 2: human-readable/text representation of the binary form
> > > 
> > > Describing this requires a basic intro to the binary form, which is a
> > > multi-level hierarchical table mapping a path of integer key values to
> > > a data blob. In text we can represent keys with symbolic constants,
> > > but they're just a way of writing the underlying numbers. For example
> > > the path strerror/0 leads to the "No error information" text,
> > > strerror/EACCES leads to the "Permission denied" text, etc. Here
> > > "strerror" just represents a number for the first-level path component
> > > where strerror strings are stored, subindexed by (the arch/generic
> > > versions of) the errno codes.
> > > 
> > > Going this route mostly avoids the need for smarts in the tooling, and
> > > "has more flexibility" to encode things. But this also potentially
> > > makes the encoding seem more arbitrary to localization folks.
> > > 
> > > Like in option 1, a sample (some hybrid between C/POSIX and a
> > > hypothetical US-English locale, whipped up quick by hand as an
> > > example) of one way this format could look is attached for reference.
> > > An obvious variant that might be friendlier/more-familiar to folks
> > > working with the data would be representing the same in json (which is
> > > easy).
> > > 
> > > 
> > > 
> > > 
> > > My leaning is towards option 1.
> > > 
> > > <sample_posix_localedef.txt><sample_binary_as_text.txt>
> > 
> > Hi Rich,
> > 
> > Thanks for continuing the locale work - very happy to see it
> > progressing!
> > 
> > I definitely prefer option 1 as well.  This will allow an easy
> > migration path for people using other Unix or Unix-like systems
> > (Solaris, AIX, glibc Linux) where localedef is also used.  It also
> > means there is also a large corpus of existing files we can use,
> > both for testing the tooling and for initial drafts at porting musl
> > to other locales.
> > 
> > I think it is reasonable to extend the file to handle translations
> > for days of the week/months.  Is there a reason the existing system
> > of gettext(3) can’t be used for strerror_l?
> 
> The fundamental problem with the current system we have is gettext
> keying off of the English string. That was fatal for [AB]MON_5 "May",
> but it's also less than ideal for error messages. For example it's
> plausible we might use the same text for an errno code as for a regex
> or getaddrinfo error message, and then the keys would clash. And of
> course if the messages are changed at all, translation files get
> invalidated.
> 
> I'll go over the proposed new binary format more when I finish writing
> it up, but on top of avoiding all these issues, it lets us get rid of
> all the repetitive linear-search-multistring operations in musl and
> replace them with efficient O(1) lookup regardless of whether a locale
> file or internal messages in libc are being used.

A. Wilcox, now that there is a proposed binary format
in https://www.openwall.com/lists/musl/2026/02/25/1  do you have any further
thoughts on this? 

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.