Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 1 Mar 2018 13:10:47 -0600
From: William Pitcock <nenolod@...eferenced.org>
To: musl@...ts.openwall.com
Subject: Re: setlocale behavior with 'missing' locales

Hi,

On Wed, Feb 28, 2018 at 7:13 PM, Rich Felker <dalias@...c.org> wrote:
> On Wed, Nov 08, 2017 at 12:27:15AM -0500, Rich Felker wrote:
>> On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote:
>> > Unfortunately this turns out to have been something of a tradeoff,
>> > since there's no way for applications (and, as it turns out,
>> > especially tests/test suites) to query whether a particular locale is
>> > "really" available. I've been asked to change the behavior to fail on
>> > unknown locale names, but of course that's not a working option in
>> > light of the above.
>> >
>> > I think there may be a solution that makes everyone happy, but I'm not
>> > sure yet. I'm going to follow up with a description and analysis of
>> > whether it's valid/conforming.
>>
>> So here's the possible solution. ISO C leaves the default locale when
>> setlocale(cat,"") is called implementation-defined. POSIX however
>> defines it in terms of the LANG and LC_* environment variables. See
>> the CX text in:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
>>
>>   "Setting all of the categories of the global locale is similar to
>>   successively setting each individual category of the global locale,
>>   except that all error checking is done before any actions are
>>   performed. To set all the categories of the global locale,
>>   setlocale() can be invoked as:
>>
>>   setlocale(LC_ALL, "");
>>
>>   In this case, setlocale() shall first verify that the values of all
>>   the environment variables it needs according to the precedence rules
>>   (described in XBD Environment Variables) indicate supported locales.
>>   If the value of any of these environment variable searches yields a
>>   locale that is not supported (and non-null), setlocale() shall
>>   return a null pointer and the global locale shall not be changed. If
>>   all environment variables name supported locales, setlocale() shall
>>   proceed as if it had been called for each category, using the
>>   appropriate value from the associated environment variable or from
>>   the implementation-defined default if there is no such value."
>>
>> and the Environment Variables text in XBD 8.2:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02
>>
>> The former seems to tie our hands: unless the locales determined by
>> the environment variables all exist, setlocale is required to fail and
>> leave us in the (unacceptable) "C" locale where UTF-8 doesn't work.
>> However the latter seems to offer us a way out. After describing how
>> the precedence of the variables work, how locale pathnames work if
>> localedef is supported (musl doesn't support it), and how
>> implementation-provided/defined locale names work, it specifies:
>>
>>   "If the locale value is not recognized by the implementation, the
>>   behavior is unspecified."
>>
>> My optimistic reading of this is that, in the event the locale name
>> provided does not correspond to something we recognize, we're free to
>> define how it's interpreted, and always interpret it as C.UTF-8.
>>
>> What this would achieve is the following:
>>
>> 1. setlocale(cat, explicit_locale_name) - succeeds if the locale
>>    actually has a definition file, fails and returns a null pointer
>>    otherwise.
>>
>> 2. setlocale(cat, "") - always succeeds, honoring the environment
>>    variable for the category if a locale definition file by that name
>>    exists, but otherwise (the unspecified behavior) treating it as if
>>    it were C.UTF-8.
>>
>> This way, applications that probe for specific locale names can do so
>> and determine if they exist, but applications that just want to use
>> the default locale the user configured will still avoid catastrophic
>> breakage (failure to support UTF-8) even if they encounter "bad" LC_*
>> variables.
>>
>> Does this approach sound acceptable? I'm fairly content with
>> interpreting it as conforming to the standard; I'm mainly concerned
>> about whether there might be unforseen breakage.
>>
>> One notable issue is that, right now, we rely on being able to set
>> LC_MESSAGES to an arbitrary name even if there's no libc locale
>> definition for it; this is because gettext() relies on the name of the
>> current LC_MESSAGES locale to find (application-specific) translation
>> files that might exist even without a libc translation. I'm not sure
>> how we would best keep this working under changes similar to the
>> above.
>
> Any further thoughts on this? I'd like to begin addressing these
> issues in this release cycle.
>
> I think the above plan works (is conforming, doesn't break things)
> except for the LC_MESSAGES issue mentioned at the end. I don't have
> any good ideas still for dealing with that. Really since gettext can
> be used with any category, not just LC_MESSAGES (although LC_MESSAGES
> is the normal choice), it applies to all categories. Maybe we could
> still use the ("nonexistant") requested locale name in this case, or
> some derivative of it that clarifies that it's synthesized...?

+1 to using this approach.

We could use a locale name such as "en_US@...tual.UTF-8".

glibc uses this style of locale name for locales such as UK english
with eurozone LC_CURRENCY: en_UK@...o.UTF-8.

William

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.