musl - Re: Selecting locale source format

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260313025602.GN1827@brightrain.aerifal.cx>
Date: Thu, 12 Mar 2026 22:56:02 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Selecting locale source format

On Wed, Sep 17, 2025 at 04:31:36PM -0400, Rich Felker wrote:
> On Tue, Sep 16, 2025 at 09:14:07PM -0400, Rich Felker wrote:
> > I have a proposed binary format for new locale files that I'm in the
> > process of writing up, but Pablo brought it to my attention that,
> > while binary format (ABI) is what's important to have down and stable
> > at the time we integrate into musl, pinning down the source format is
> > what's important/blocking for collaboration with localization folks.
> > 
> > I have two candidate formats in the works right now for this:
> > 
> > 
> > 
> > Option 1: subset+extension of POSIX localedef format.
> > 
> > The basis for this format is described in
> > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
> > 
> > If we go this way, it would be a "subset" because (1) some parts are
> > not relevant, like LC_CTYPE, which does not vary by locale, (2) some
> > parts will necessarily be represented in different ways, like
> > collation where we're using UCA rather than the POSIX form, and (3)
> > the format just has a lot of gratuitous cruft like symbolic character
> > names. It will also necessarily be extended because POSIX localedef
> > has no way to represent translated error strings etc. - keys for them
> > have to be added.
> > 
> > Going this route would have the source data in a fairly compact and
> > "well-known" (to certain audiences) form, but requires that the
> > tooling to produce binary locale files be aware of how these fields
> > translate to the data model for the binary form.
> > 
> > A sample (should be roughly correct C/POSIX locale) is attached for
> > reference.
> 
> Based on my and others' preference so far being this option 1, I've
> been putting together a short program to programmatically generate a
> file in this format from the active host locale. This seems useful
> both as a source of the template and as a means to verify that all of
> the existing information is represented/representable.
> 
> The attached version should be dumping all should-be-localizable data
> from musl except signal descriptions (strsignal). These require some
> consideration since the set of signals that need naming is very
> slightly arch-specific (there is a largely unused "SIGEMT" on mips* in
> place of the also largely unused "SIGSTKFLT" on other archs), and
> there is fundamentally no way to extract the string for the one that's
> not present on the host arch.
> 
> Another slight omission that needs consideration is having keys for
> the "unknown error" cases. For strerror we just treat unknowns the
> same as 0 ("No error information", not "Success"), but for regerror,
> REG_OK is treated distinctly from invalid error codes. gai_strerror
> and hstrerror are like this too, but by choice; we could assign "0" as
> "unknown" easily for them. Signals already use 0 as "unknown".

Any opinions on key names for these in the source file? E0 and H0 are
already in the draft format for strerror and hstrerror, but aren't
really consistent and don't allow differentiating unknown (out of
bounds or unassigned error code) from 0 (errno wasn't set again after
application set it to 0). While we don't distinguish those now, it
seems like we shouldn't lock ourselves out of distinguishing them.

I think I'd like to use E0 and H0 and EAI_0 for the case where the
error code is 0, matching REG_OK that's already there, and add new
keys for unknown/unmatched for all of them. Whatever name we choose
should be selected not to clash with any future additions, so
something like EUNKNOWN would be a really bad choice. Off the top of
my head, I'd propose: E_, H_, EAI__, REG__, etc. So you'd have
something like:

E_ "Unknown error"
E0 "No error information"
EACCES "Permission denied"

REG__ "Unknown error"
REG_OK "No error"
REG_NOMATCH "No match"

etc.

And while this isn't a matter for the source format, I'd assign the
value -1 to these unknown keys, so that the key has a value that's
contiguous with the range of error codes but guaranteed not to match
one of them. (Note: EAI_* are negative to begin with, so we could
negate them all, or just accept the data table using negatives and put
EAI__ as +1. But this again is only a matter for the binary format not
source.)

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.