Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250917173745.GV1827@brightrain.aerifal.cx>
Date: Wed, 17 Sep 2025 13:37:45 -0400
From: Rich Felker <dalias@...c.org>
To: enh <enh@...gle.com>
Cc: musl@...ts.openwall.com
Subject: Re: Selecting locale source format

On Wed, Sep 17, 2025 at 11:43:46AM -0400, enh wrote:
> On Tue, Sep 16, 2025 at 9:14 PM Rich Felker <dalias@...c.org> wrote:
> >
> > I have a proposed binary format for new locale files that I'm in the
> > process of writing up, but Pablo brought it to my attention that,
> > while binary format (ABI) is what's important to have down and stable
> > at the time we integrate into musl, pinning down the source format is
> > what's important/blocking for collaboration with localization folks.
> >
> > I have two candidate formats in the works right now for this:
> >
> >
> >
> > Option 1: subset+extension of POSIX localedef format.
> >
> > The basis for this format is described in
> > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html
> >
> > If we go this way, it would be a "subset" because (1) some parts are
> > not relevant, like LC_CTYPE, which does not vary by locale,
> 
> note that that's not true

This was a statement about musl and musl's LC_CTYPE, not about what
you could theoretically do.

> for 'i' in turkish/azeri locales, for
> example. (unless you meant that you plan on using the unicode cldr
> data directly here.)
> 
> see the "Language-Sensitive Mappings" section of SpecialCasing.txt for
> all the special cases.

There really is not a way to support this except in legacy 8bit
encodings, which are out-of-scope for musl, This is because the
interface doesn't have any way for toupper() or tolower() to map to a
multibyte sequence. AFAICT tolower/toupper and towlower/towupper have
to be consistent with each other, but can't be.

In any case re-litigating this is not in the scope of the project at
hand.

There is all sorts of complexity to transforming case of
natural-language text that cannot adequately be supported by any of
the standard C interfaces but that requires a more expressive
framework. The standard interfaces are really not suitable for
anything more than case-insensitive comparisons (if even that; they
don't suffice even for that in the case of ß vs SS) or other very
basic uses.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.