musl - Re: Planned locale work and community thoughts

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250807192507.GO1827@brightrain.aerifal.cx>
Date: Thu, 7 Aug 2025 15:25:07 -0400
From: Rich Felker <dalias@...c.org>
To: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
Cc: musl@...ts.openwall.com
Subject: Re: Planned locale work and community thoughts

On Wed, Jun 18, 2025 at 03:28:47PM -0400, Rich Felker wrote:
> On Mon, Jun 02, 2025 at 07:37:51PM +0200, Pablo Correa Gomez wrote:
> > Hi everybody,
> > 
> > I am Pablo Correa Gomez, a member of postmarketOS Core Contributors,
> > working on the collation and locale overhaul project
> > (https://www.openwall.com/lists/musl/2025/05/05/5)together with Rich.
> > 
> > We have now more details on the planned locale work that was earlier
> > announced. The current musl locales experience is sub-par compared to
> > other platforms, and we plan to use this project to fix that. 
> > 
> > The main and biggest issue that we aim to solve is the representation
> > format of the locale strings. The initial implementation used English
> > strings as keys to lookup for translations. This had a major issue
> > where May would represent both the abbreviated and non-abbreviated
> > forms of the month, making it untranslatable in languages where May has
> > more than 3 letters. However, there are other different issues that are
> > also aiming to solve in this project:
> 
> Main decision to be made here is how we key items that need
> localization, whether by fixing the string-based keying (e.g. using
> the macro names like "ABMON5" as the keys) with the gettext-type
> lookup we have now, or switching to assigned integer indices as the
> keying for a more catgets-like system (likely using the values from
> the macros in langinfo.h as the indices), or something else.

I've been reviewing the options here with the intent of making a
proposal, and my thinking so far is that neither of the above (catgets
approach or existing gettext approach) is very good.

While the integer keys approach of catgets solves the problem of
English-string being a really poor key for looking up the localized
value, the gratuitous vastness of the 32x32 bit keyspace necessitates
binary search which is undesirably costly and pretty much entirely
defeats any runtime-efficiency argument for integer keying over
gettext-style string keying.

What I'm leaning towards proposing is a direct integer-indexed
multi-level table, analogous in form to the tables collation weight
lookup will use.

For reference, the currently needed lookups are:

1. nl_langinfo keys (this also covers all date/time functionality)
2. strerror (errno.h codes)
3. gai_strerror (netdb.h EAI_* codes)
4. hstrerror (legacy getXbyY resolver API error codes)
5. regerror (regex.h REG_* error codes)

And the added ones we will have are:

6. struct localeconv contents (for LC_NUMERIC/LC_MONETARY)
7. collation weight table roots (for LC_COLLATE)

Of these, items 1 and 3-5 already have arch-independent sequential
keys that are public constants (item 1 has them as 2-level, which is
fine). Item 2 (strerror) does require index remapping on archs with
their own numbering. Item 6 (localeconv) does not have lookups
addressible by an application, just stuffing data into a struct at
locale load-time, so there are no real constraints to set here. I can
just propose a simple format for the data to be loaded from. And item
7 is its own thing already covered by a multi-level table.

All of the indexing for 1-5 (for 2, via the arch/generic version of
errno.h) comes from constants that are already public ABI and thus
stable.

Anything in 6-7 is up to us to define in a way that's stable and
future-proof/extensible.

I'll follow up with more of this fleshed out.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.