|
|
Message-ID: <02812813e9f5be4299a0e38c33a04d6a2e08c6f7.camel@postmarketos.org>
Date: Mon, 02 Mar 2026 14:54:36 +0100
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>, musl@...ts.openwall.com
Subject: Re: Selecting locale source format
El Wed, 17-09-2025 a las 16:31 -0400, Rich Felker escribió:
> On Tue, Sep 16, 2025 at 09:14:07PM -0400, Rich Felker wrote:
> > I have a proposed binary format for new locale files that I'm in the
> > process of writing up, but Pablo brought it to my attention that,
> > while binary format (ABI) is what's important to have down and stable
> > at the time we integrate into musl, pinning down the source format is
> > what's important/blocking for collaboration with localization folks.
> >
> > I have two candidate formats in the works right now for this:
> >
> >
> >
> > Option 1: subset+extension of POSIX localedef format.
> >
> > The basis for this format is described in
> > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
> >
> > If we go this way, it would be a "subset" because (1) some parts are
> > not relevant, like LC_CTYPE, which does not vary by locale, (2) some
> > parts will necessarily be represented in different ways, like
> > collation where we're using UCA rather than the POSIX form, and (3)
> > the format just has a lot of gratuitous cruft like symbolic character
> > names. It will also necessarily be extended because POSIX localedef
> > has no way to represent translated error strings etc. - keys for them
> > have to be added.
> >
> > Going this route would have the source data in a fairly compact and
> > "well-known" (to certain audiences) form, but requires that the
> > tooling to produce binary locale files be aware of how these fields
> > translate to the data model for the binary form.
> >
> > A sample (should be roughly correct C/POSIX locale) is attached for
> > reference.
>
> Based on my and others' preference so far being this option 1, I've
> been putting together a short program to programmatically generate a
> file in this format from the active host locale. This seems useful
> both as a source of the template and as a means to verify that all of
> the existing information is represented/representable.
>
> The attached version should be dumping all should-be-localizable data
> from musl except signal descriptions (strsignal). These require some
> consideration since the set of signals that need naming is very
> slightly arch-specific (there is a largely unused "SIGEMT" on mips* in
> place of the also largely unused "SIGSTKFLT" on other archs), and
> there is fundamentally no way to extract the string for the one that's
> not present on the host arch.
>
> Another slight omission that needs consideration is having keys for
> the "unknown error" cases. For strerror we just treat unknowns the
> same as 0 ("No error information", not "Success"), but for regerror,
> REG_OK is treated distinctly from invalid error codes. gai_strerror
> and hstrerror are like this too, but by choice; we could assign "0" as
> "unknown" easily for them. Signals already use 0 as "unknown".
>
> Running the program also exposes some errors in musl's built-in
> C/C.UTF-8 locale, such as the LC_TIME era and alt digits stuff
> containing copies of the non-era/normal-digits data rather than ""
> indicating "not available". I don't know why this was done; aside from
> ALT_DIGITS it doesn't even work for the old gettext-type locale
> support because duplicating the non-era strings as keys inherently
> gives duplicate keys.
>
> Current draft of the generation program is attached.
I run this program with LANG=es_ES.UTF-8 with both musl and debian bookworm
glibc to have a bit of a look at the comparison. One can clearly see the May bug
in the Spanish locale in musl. To compile the programm under glibc I had to
remove REG_OK and EAI_NODATA macros, else it "just worked".
>
> Rich
View attachment "gcc.dump" of type "text/plain" (6648 bytes)
View attachment "musl.dump" of type "text/plain" (6273 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.