![]() |
|
Message-ID: <20250917203136.GW1827@brightrain.aerifal.cx>
Date: Wed, 17 Sep 2025 16:31:36 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Selecting locale source format
On Tue, Sep 16, 2025 at 09:14:07PM -0400, Rich Felker wrote:
> I have a proposed binary format for new locale files that I'm in the
> process of writing up, but Pablo brought it to my attention that,
> while binary format (ABI) is what's important to have down and stable
> at the time we integrate into musl, pinning down the source format is
> what's important/blocking for collaboration with localization folks.
>
> I have two candidate formats in the works right now for this:
>
>
>
> Option 1: subset+extension of POSIX localedef format.
>
> The basis for this format is described in
> https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
>
> If we go this way, it would be a "subset" because (1) some parts are
> not relevant, like LC_CTYPE, which does not vary by locale, (2) some
> parts will necessarily be represented in different ways, like
> collation where we're using UCA rather than the POSIX form, and (3)
> the format just has a lot of gratuitous cruft like symbolic character
> names. It will also necessarily be extended because POSIX localedef
> has no way to represent translated error strings etc. - keys for them
> have to be added.
>
> Going this route would have the source data in a fairly compact and
> "well-known" (to certain audiences) form, but requires that the
> tooling to produce binary locale files be aware of how these fields
> translate to the data model for the binary form.
>
> A sample (should be roughly correct C/POSIX locale) is attached for
> reference.
Based on my and others' preference so far being this option 1, I've
been putting together a short program to programmatically generate a
file in this format from the active host locale. This seems useful
both as a source of the template and as a means to verify that all of
the existing information is represented/representable.
The attached version should be dumping all should-be-localizable data
from musl except signal descriptions (strsignal). These require some
consideration since the set of signals that need naming is very
slightly arch-specific (there is a largely unused "SIGEMT" on mips* in
place of the also largely unused "SIGSTKFLT" on other archs), and
there is fundamentally no way to extract the string for the one that's
not present on the host arch.
Another slight omission that needs consideration is having keys for
the "unknown error" cases. For strerror we just treat unknowns the
same as 0 ("No error information", not "Success"), but for regerror,
REG_OK is treated distinctly from invalid error codes. gai_strerror
and hstrerror are like this too, but by choice; we could assign "0" as
"unknown" easily for them. Signals already use 0 as "unknown".
Running the program also exposes some errors in musl's built-in
C/C.UTF-8 locale, such as the LC_TIME era and alt digits stuff
containing copies of the non-era/normal-digits data rather than ""
indicating "not available". I don't know why this was done; aside from
ALT_DIGITS it doesn't even work for the old gettext-type locale
support because duplicating the non-era strings as keys inherently
gives duplicate keys.
Current draft of the generation program is attached.
Rich
View attachment "dumplocale.c" of type "text/plain" (6764 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.