![]() |
|
Message-Id: <7B7869C8-5B94-4157-96DE-2B09B12BC64A@Wilcox-Tech.com> Date: Tue, 16 Sep 2025 20:23:09 -0500 From: "A. Wilcox" <AWilcox@...cox-Tech.com> To: musl@...ts.openwall.com Subject: Re: Selecting locale source format On Sep 16, 2025, at 20:14, Rich Felker <dalias@...c.org> wrote: > > I have a proposed binary format for new locale files that I'm in the > process of writing up, but Pablo brought it to my attention that, > while binary format (ABI) is what's important to have down and stable > at the time we integrate into musl, pinning down the source format is > what's important/blocking for collaboration with localization folks. > > I have two candidate formats in the works right now for this: > > > > Option 1: subset+extension of POSIX localedef format. > > The basis for this format is described in > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html > > If we go this way, it would be a "subset" because (1) some parts are > not relevant, like LC_CTYPE, which does not vary by locale, (2) some > parts will necessarily be represented in different ways, like > collation where we're using UCA rather than the POSIX form, and (3) > the format just has a lot of gratuitous cruft like symbolic character > names. It will also necessarily be extended because POSIX localedef > has no way to represent translated error strings etc. - keys for them > have to be added. > > Going this route would have the source data in a fairly compact and > "well-known" (to certain audiences) form, but requires that the > tooling to produce binary locale files be aware of how these fields > translate to the data model for the binary form. > > A sample (should be roughly correct C/POSIX locale) is attached for > reference. > > > > > Option 2: human-readable/text representation of the binary form > > Describing this requires a basic intro to the binary form, which is a > multi-level hierarchical table mapping a path of integer key values to > a data blob. In text we can represent keys with symbolic constants, > but they're just a way of writing the underlying numbers. For example > the path strerror/0 leads to the "No error information" text, > strerror/EACCES leads to the "Permission denied" text, etc. Here > "strerror" just represents a number for the first-level path component > where strerror strings are stored, subindexed by (the arch/generic > versions of) the errno codes. > > Going this route mostly avoids the need for smarts in the tooling, and > "has more flexibility" to encode things. But this also potentially > makes the encoding seem more arbitrary to localization folks. > > Like in option 1, a sample (some hybrid between C/POSIX and a > hypothetical US-English locale, whipped up quick by hand as an > example) of one way this format could look is attached for reference. > An obvious variant that might be friendlier/more-familiar to folks > working with the data would be representing the same in json (which is > easy). > > > > > My leaning is towards option 1. > > <sample_posix_localedef.txt><sample_binary_as_text.txt> Hi Rich, Thanks for continuing the locale work - very happy to see it progressing! I definitely prefer option 1 as well. This will allow an easy migration path for people using other Unix or Unix-like systems (Solaris, AIX, glibc Linux) where localedef is also used. It also means there is also a large corpus of existing files we can use, both for testing the tooling and for initial drafts at porting musl to other locales. I think it is reasonable to extend the file to handle translations for days of the week/months. Is there a reason the existing system of gettext(3) can’t be used for strerror_l? Best, -Anna
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.