Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Thu, 10 Aug 2023 16:41:38 +0100
From: Alastair Houghton <ahoughton@...le.com>
To: musl@...ts.openwall.com
Cc: Rich Felker <dalias@...c.org>
Subject: setlocale() again

Hi again,

I spent some time today looking at the setlocale() problem and thought I’d put some notes down in an email.

1. Musl wishes to support UTF-8 “out of the box”.

2. At the same time, it needs to be 8-bit-safe, so the default locale, C, is NOT UTF-8.

3. POSIX, and the C standard, specify that setlocale() should fail if the locale name isn’t a valid locale, but don’t really say what that means precisely.  A program that wants UTF-8 support and that does `setlocale(LC_ALL, “”)` can therefore find itself in the C locale if the one specified in the environment happens to be invalid.

4. This seemed undesirable, so setlocale() presently accepts any locale name as valid; if it doesn’t have a definition file for a locale, it will copy the C.UTF-8 locale, giving it the name passed in and return that.  This avoids the problem in (3), and also means that gettext() will work for any language without installing locale data for Musl.  Unfortunately it also means that there is no way for a program (notably a test suite) to determine the presence of data for a locale, because setlocale() will always succeed, even if we don’t have the data.

5. Back in 2017 (https://www.openwall.com/lists/musl/2017/11/08/2) Rich was proposing to change things so that `setlocale(cat, “”)` always succeeds, but if the environment specifies an unknown locale, treats it as C.UTF-8, while `setlocale(cat, explicit_name)` will fail unless a valid definition file is installed for that locale name.  This would also avoid the problem in (3), although it will mean that gettext() will not work unless a valid locale definition is installed for the C library (BTW, this is exactly the situation Glibc is in here; if Glibc doesn’t have locale data, it will fail setlocale() and then gettext() will find itself in the C locale).  On the other hand, it does mean that programs can detect whether or not a given locale is present.

Why do I care?  Because I’m trying to make libc++ work with Musl and right now it has failing tests because it expects (not entirely unreasonably) that if e.g. `setlocale(LC_ALL, “fr_FR”)` succeeds, then the C library will localise things into French.  While I can test for the unusual behaviour of Musl detailed in (4), the libc++ maintainer understandably doesn’t like it and we would both far rather Musl were fixed to behave similarly to other implementations.

It seems to me that Rich’s proposal (5) was sensible.  Programs that use gettext(), and users relying on it for localization, must already cope with the fact that the C library must have locale data for their chosen locale in order for gettext() to work; that is how things work on Glibc.  It so happened that (4) meant that such programs would work with partial localization on Musl without there being any locale data installed for Musl, but that isn’t really right (e.g. you might get a mix of localized strings from gettext() but with numeric formatting that didn’t match - for French, for instance, numbers would have “.”s instead of “,”s as a decimal separator).

Looking at the 2017 thread, it appears it didn’t go anywhere for whatever reason, so I’d like to understand the status of the proposed change.  Was it nixed for some reason?  Is it likely to happen in the future?  If it’s a matter of resource, if I were to raise a patch for it, would it be accepted, in principle?

Kind regards,

Alastair.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.