musl - setlocale behavior with 'missing' locales

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171108050338.GL1627@brightrain.aerifal.cx>
Date: Wed, 8 Nov 2017 00:03:38 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: setlocale behavior with 'missing' locales

One of the primary concerns when the byte-based C locale was added(*)
was not to introduce regressions in the property that musl is "always
UTF-8" except when the user or application has explicitly requested a
byte-based ("C"/"POSIX") locale.

First, some background: In order for the standard libc interfaces to
honor character encoding, a portable program has always needed to call
setlocale(LC_CTYPE, "") or setlocale(LC_ALL, ""). Addition of the
byte-based C locale "disabled UTF-8" in any application which wasn't
calling setlocale, but that was deemed acceptable since such
applications were not portable and would not work on other systems
anyway.

The other important cases to consider were failure of setlocale. Prior
to the addition of the byte-based C locale, setlocale was essentially
a no-op, and from a practical standpoint it didn't matter if it
succeeded or failed because the preexisting "C" locale at program
entry already provided UTF-8. But afterwards, if setlocale failed for
some reason, applications that were trying to do the right thing would
suffer regression.

We ruled out spurious failure for resource exhaustion reasons by
making a statically allocated C.UTF-8 locale object. But the other
possible source of failure would have been having LC_* variables in
the environment (perhaps as a result of ssh'ing from another system or
running a musl-linked binary on a glibc-based system) with no
corresponding locale files for musl. If we treated that as an error,
UTF-8 would have suddenly broken in all sorts of real-world
situtations, and one of the core original design goals/values of musl
would have been broken.

The choice I made at the time to avoid this was to declare that all
locale names are valid locales, and if there's no actual file defining
the locale, it's simply a clone of C.UTF-8. So for example if you run
with LC_ALL=fr_FR but no fr_FR translation file, you get a locale
named fr_FR (that's what setlocale reports as the active locale) but
with no translated messages/dates/etc., just UTF-8 character encoding
(so you're still able to access all characters properly and use
localized or multilingual data).

Unfortunately this turns out to have been something of a tradeoff,
since there's no way for applications (and, as it turns out,
especially tests/test suites) to query whether a particular locale is
"really" available. I've been asked to change the behavior to fail on
unknown locale names, but of course that's not a working option in
light of the above.

I think there may be a solution that makes everyone happy, but I'm not
sure yet. I'm going to follow up with a description and analysis of
whether it's valid/conforming.

Rich






(*) References on byte-based C locale:

Subject: [musl] Possible bytelocale patch
Message-ID: <20140703071318.GA10117@...ghtrain.aerifal.cx>

Subject: [musl] Revisiting byte-based C locale
Message-ID: <20150522022203.GA26651@...ghtrain.aerifal.cx>

Subject: [musl] [PATCH] Byte-based C locale, draft 1
Message-ID: <20150606214007.GA17398@...ghtrain.aerifal.cx>

commit 1507ebf837334e9e07cfab1ca1c2e88449069a80
byte-based C locale, phase 1: multibyte character handling functions

commit 16f18d036d9a7bf590ee6eb86785c0a9658220b6
byte-based C locale, phase 2: stdio and iconv (multibyte callers)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.