Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66590520b9fef551b6fa0f3b0b6beed579b413e4.camel@postmarketos.org>
Date: Fri, 19 Sep 2025 16:06:12 +0200
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>, "A. Wilcox" <AWilcox@...cox-Tech.com>
Cc: musl@...ts.openwall.com
Subject: Re: Selecting locale source format

El mar, 16-09-2025 a las 21:36 -0400, Rich Felker escribió:
> On Tue, Sep 16, 2025 at 08:23:09PM -0500, A. Wilcox wrote:
> > On Sep 16, 2025, at 20:14, Rich Felker <dalias@...c.org> wrote:
> > > 
> > > I have a proposed binary format for new locale files that I'm in
> > > the
> > > process of writing up, but Pablo brought it to my attention that,
> > > while binary format (ABI) is what's important to have down and
> > > stable
> > > at the time we integrate into musl, pinning down the source
> > > format is
> > > what's important/blocking for collaboration with localization
> > > folks.
> > > 
> > > I have two candidate formats in the works right now for this:
> > > 
> > > 
> > > 
> > > Option 1: subset+extension of POSIX localedef format.
> > > 
> > > The basis for this format is described in
> > > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap
> > > 07.html
> > > 
> > > If we go this way, it would be a "subset" because (1) some parts
> > > are
> > > not relevant, like LC_CTYPE, which does not vary by locale, (2)
> > > some
> > > parts will necessarily be represented in different ways, like
> > > collation where we're using UCA rather than the POSIX form, and
> > > (3)
> > > the format just has a lot of gratuitous cruft like symbolic
> > > character
> > > names. It will also necessarily be extended because POSIX
> > > localedef
> > > has no way to represent translated error strings etc. - keys for
> > > them
> > > have to be added.
> > > 
> > > Going this route would have the source data in a fairly compact
> > > and
> > > "well-known" (to certain audiences) form, but requires that the
> > > tooling to produce binary locale files be aware of how these
> > > fields
> > > translate to the data model for the binary form.
> > > 
> > > A sample (should be roughly correct C/POSIX locale) is attached
> > > for
> > > reference.
> > > 
> > > 
> > > 
> > > 
> > > Option 2: human-readable/text representation of the binary form
> > > 
> > > Describing this requires a basic intro to the binary form, which
> > > is a
> > > multi-level hierarchical table mapping a path of integer key
> > > values to
> > > a data blob. In text we can represent keys with symbolic
> > > constants,
> > > but they're just a way of writing the underlying numbers. For
> > > example
> > > the path strerror/0 leads to the "No error information" text,
> > > strerror/EACCES leads to the "Permission denied" text, etc. Here
> > > "strerror" just represents a number for the first-level path
> > > component
> > > where strerror strings are stored, subindexed by (the
> > > arch/generic
> > > versions of) the errno codes.
> > > 
> > > Going this route mostly avoids the need for smarts in the
> > > tooling, and
> > > "has more flexibility" to encode things. But this also
> > > potentially
> > > makes the encoding seem more arbitrary to localization folks.
> > > 
> > > Like in option 1, a sample (some hybrid between C/POSIX and a
> > > hypothetical US-English locale, whipped up quick by hand as an
> > > example) of one way this format could look is attached for
> > > reference.
> > > An obvious variant that might be friendlier/more-familiar to
> > > folks
> > > working with the data would be representing the same in json
> > > (which is
> > > easy).
> > > 
> > > 
> > > 
> > > 
> > > My leaning is towards option 1.
> > > 
> > > <sample_posix_localedef.txt><sample_binary_as_text.txt>
> > 
> > Hi Rich,
> > 
> > Thanks for continuing the locale work - very happy to see it
> > progressing!
> > 
> > I definitely prefer option 1 as well.  This will allow an easy
> > migration path for people using other Unix or Unix-like systems
> > (Solaris, AIX, glibc Linux) where localedef is also used.  It also
> > means there is also a large corpus of existing files we can use,
> > both for testing the tooling and for initial drafts at porting musl
> > to other locales.
> > 
> > I think it is reasonable to extend the file to handle translations
> > for days of the week/months.  Is there a reason the existing system
> > of gettext(3) can’t be used for strerror_l?
> 
> The fundamental problem with the current system we have is gettext
> keying off of the English string. That was fatal for [AB]MON_5 "May",
> but it's also less than ideal for error messages. For example it's
> plausible we might use the same text for an errno code as for a regex
> or getaddrinfo error message, and then the keys would clash. And of
> course if the messages are changed at all, translation files get
> invalidated.

@A.Wilcox, in case you missed it, the decision to go for this kind of
representation was discussed in
https://www.openwall.com/lists/musl/2025/06/02/2, point 1. Sorry that
ended up being a bit of a long email.

Best,
Pablo


> 
> I'll go over the proposed new binary format more when I finish
> writing
> it up, but on top of avoiding all these issues, it lets us get rid of
> all the repetitive linear-search-multistring operations in musl and
> replace them with efficient O(1) lookup regardless of whether a
> locale
> file or internal messages in libc are being used.
> 
> Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.