![]() |
|
Message-ID: <66590520b9fef551b6fa0f3b0b6beed579b413e4.camel@postmarketos.org> Date: Fri, 19 Sep 2025 16:06:12 +0200 From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org> To: Rich Felker <dalias@...c.org>, "A. Wilcox" <AWilcox@...cox-Tech.com> Cc: musl@...ts.openwall.com Subject: Re: Selecting locale source format El mar, 16-09-2025 a las 21:36 -0400, Rich Felker escribió: > On Tue, Sep 16, 2025 at 08:23:09PM -0500, A. Wilcox wrote: > > On Sep 16, 2025, at 20:14, Rich Felker <dalias@...c.org> wrote: > > > > > > I have a proposed binary format for new locale files that I'm in > > > the > > > process of writing up, but Pablo brought it to my attention that, > > > while binary format (ABI) is what's important to have down and > > > stable > > > at the time we integrate into musl, pinning down the source > > > format is > > > what's important/blocking for collaboration with localization > > > folks. > > > > > > I have two candidate formats in the works right now for this: > > > > > > > > > > > > Option 1: subset+extension of POSIX localedef format. > > > > > > The basis for this format is described in > > > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap > > > 07.html > > > > > > If we go this way, it would be a "subset" because (1) some parts > > > are > > > not relevant, like LC_CTYPE, which does not vary by locale, (2) > > > some > > > parts will necessarily be represented in different ways, like > > > collation where we're using UCA rather than the POSIX form, and > > > (3) > > > the format just has a lot of gratuitous cruft like symbolic > > > character > > > names. It will also necessarily be extended because POSIX > > > localedef > > > has no way to represent translated error strings etc. - keys for > > > them > > > have to be added. > > > > > > Going this route would have the source data in a fairly compact > > > and > > > "well-known" (to certain audiences) form, but requires that the > > > tooling to produce binary locale files be aware of how these > > > fields > > > translate to the data model for the binary form. > > > > > > A sample (should be roughly correct C/POSIX locale) is attached > > > for > > > reference. > > > > > > > > > > > > > > > Option 2: human-readable/text representation of the binary form > > > > > > Describing this requires a basic intro to the binary form, which > > > is a > > > multi-level hierarchical table mapping a path of integer key > > > values to > > > a data blob. In text we can represent keys with symbolic > > > constants, > > > but they're just a way of writing the underlying numbers. For > > > example > > > the path strerror/0 leads to the "No error information" text, > > > strerror/EACCES leads to the "Permission denied" text, etc. Here > > > "strerror" just represents a number for the first-level path > > > component > > > where strerror strings are stored, subindexed by (the > > > arch/generic > > > versions of) the errno codes. > > > > > > Going this route mostly avoids the need for smarts in the > > > tooling, and > > > "has more flexibility" to encode things. But this also > > > potentially > > > makes the encoding seem more arbitrary to localization folks. > > > > > > Like in option 1, a sample (some hybrid between C/POSIX and a > > > hypothetical US-English locale, whipped up quick by hand as an > > > example) of one way this format could look is attached for > > > reference. > > > An obvious variant that might be friendlier/more-familiar to > > > folks > > > working with the data would be representing the same in json > > > (which is > > > easy). > > > > > > > > > > > > > > > My leaning is towards option 1. > > > > > > <sample_posix_localedef.txt><sample_binary_as_text.txt> > > > > Hi Rich, > > > > Thanks for continuing the locale work - very happy to see it > > progressing! > > > > I definitely prefer option 1 as well. This will allow an easy > > migration path for people using other Unix or Unix-like systems > > (Solaris, AIX, glibc Linux) where localedef is also used. It also > > means there is also a large corpus of existing files we can use, > > both for testing the tooling and for initial drafts at porting musl > > to other locales. > > > > I think it is reasonable to extend the file to handle translations > > for days of the week/months. Is there a reason the existing system > > of gettext(3) can’t be used for strerror_l? > > The fundamental problem with the current system we have is gettext > keying off of the English string. That was fatal for [AB]MON_5 "May", > but it's also less than ideal for error messages. For example it's > plausible we might use the same text for an errno code as for a regex > or getaddrinfo error message, and then the keys would clash. And of > course if the messages are changed at all, translation files get > invalidated. @A.Wilcox, in case you missed it, the decision to go for this kind of representation was discussed in https://www.openwall.com/lists/musl/2025/06/02/2, point 1. Sorry that ended up being a bit of a long email. Best, Pablo > > I'll go over the proposed new binary format more when I finish > writing > it up, but on top of avoiding all these issues, it lets us get rid of > all the repetitive linear-search-multistring operations in musl and > replace them with efficient O(1) lookup regardless of whether a > locale > file or internal messages in libc are being used. > > Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.