![]() |
|
Message-ID: <20250917173745.GV1827@brightrain.aerifal.cx> Date: Wed, 17 Sep 2025 13:37:45 -0400 From: Rich Felker <dalias@...c.org> To: enh <enh@...gle.com> Cc: musl@...ts.openwall.com Subject: Re: Selecting locale source format On Wed, Sep 17, 2025 at 11:43:46AM -0400, enh wrote: > On Tue, Sep 16, 2025 at 9:14 PM Rich Felker <dalias@...c.org> wrote: > > > > I have a proposed binary format for new locale files that I'm in the > > process of writing up, but Pablo brought it to my attention that, > > while binary format (ABI) is what's important to have down and stable > > at the time we integrate into musl, pinning down the source format is > > what's important/blocking for collaboration with localization folks. > > > > I have two candidate formats in the works right now for this: > > > > > > > > Option 1: subset+extension of POSIX localedef format. > > > > The basis for this format is described in > > https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html > > > > If we go this way, it would be a "subset" because (1) some parts are > > not relevant, like LC_CTYPE, which does not vary by locale, > > note that that's not true This was a statement about musl and musl's LC_CTYPE, not about what you could theoretically do. > for 'i' in turkish/azeri locales, for > example. (unless you meant that you plan on using the unicode cldr > data directly here.) > > see the "Language-Sensitive Mappings" section of SpecialCasing.txt for > all the special cases. There really is not a way to support this except in legacy 8bit encodings, which are out-of-scope for musl, This is because the interface doesn't have any way for toupper() or tolower() to map to a multibyte sequence. AFAICT tolower/toupper and towlower/towupper have to be consistent with each other, but can't be. In any case re-litigating this is not in the scope of the project at hand. There is all sorts of complexity to transforming case of natural-language text that cannot adequately be supported by any of the standard C interfaces but that requires a more expressive framework. The standard interfaces are really not suitable for anything more than case-insensitive comparisons (if even that; they don't suffice even for that in the case of ß vs SS) or other very basic uses. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.