musl - Re: Locale bikeshed time

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140724160150.GA4038@brightrain.aerifal.cx>
Date: Thu, 24 Jul 2014 12:01:50 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Locale bikeshed time

On Thu, Jul 24, 2014 at 05:35:26PM +0200, u-igbb@...ey.se wrote:
> On Wed, Jul 23, 2014 at 05:01:20PM -0400, Rich Felker wrote:
> > > This feels appropriate - if the definitions indeed fall into distinctive
> > > classes like "full" / "single-category" and also if the naming reflects
> > > the distinction
> > 
> > IMO language-based locales should be ll, lll, ll_TT, or lll_TT form
> > where ll or lll is lowercase ISO language code and TT is uppercase
> > territory code. Non-language-based locale files should avoid these
> > patterns.
> 
> Just for certainty:
> 
> I assume you mean "l" above being lower case and non-language-based
> definitions to begin/consist of uppercase letters? Totally avoiding two-
> and three-letter combinations would be hardly followed by less scrupulous
> parties :) but you certainly did not mean this.

I just meant that language-based locales should match the pattern:

^[[:lower:]]{2,3}(_[[:upper:]]{2})?([[:punct:]].*)?$

assuming I didn't make any stupid mistakes in writing that regex. And
non-language-based locales should not match this pattern.

BTW POSIX actually describes this pattern (or similar) for locale
names under the XSI option.

> Btw do we have to also use lll (the three-letter codes) or would be
> the two-letter ones sufficient?

I believe there are some languages for which there is no two-letter
code. (Note that even the whole 26x26 space is probably insufficient
to represent all of the world's languages, and for practical purposes,
the letters should have some correspondence with the name of the
language.)

> I understand that this is not an implementation question but rather a
> discipline/policy one but in the long run it helps enormously to have
> a clean deployment idea from the beginning.

Agreed.

> An example of a spectacular failure to do so were the xkb keyboard maps.
> [
>   Two incompatible representations were in use, for many years (!) One was
>   reasonable, structured by country i.e. reflecting different countries'
>   actual standards. The other one was broken by design, using "language"
>   as the main key without any actual definition of its semantics. This
>   led to many of the available definitions being a hardly useful hacks
>   (and of course to a lot of confusion for everyone as this thing was
>   impossible to document). Remarkably even the maintainers of the maps
>   at x.org/freedesktop.org at the time did not realize the origin of the
>   problem. I happen to have been involved into clarifying the issue,
>   now the structure of xkb/symbols is reasonable.
> ]

This text is utterly backwards, and I've complained about the policy
before, but gotten nowhere with it. Yes many languages have keyboard
variants connected to a particular geographic territory (this is
mainly true for European languages, not so much for the rest of the
world), but it does not make keyboard layout a property of country.
You also have:

- Users who speak and use languages that have no relation to the
  country where they're living.

- Languages which have no territory.

- Languages used in territories where the country it belongs to is
  disputed.

- Etc.

All of these issues make country-based keyboard selection at best
inconvenient, and at worst culturally and politically offensive, to
users. And offending users is utterly bad policy.

The same issue exists in glibc -- for a long time, their policy
mandated that all locales have a territory associated with them, and
this (along with other stupid policy) was preventing the addition of
the Esperanto locale. See:

https://sourceware.org/bugzilla/show_bug.cgi?id=16190

I believe the policy has been fixed now, but the discussion happened
on a different bug tracker issue and/or mailing list thread, and I
don't have the link.

> I am afraid that not stating a clean usage model may harm musl deployments
> too (say by mixing two- and three-letter locale codes so that one can not
> sanely know which kind to use).

The reasonable approach to this is probably just using the
three-letter codes for languages that do not have a two-letter code.
In practice I haven't seen such translations/locales on other systems,
but we certainly don't want to preclude them.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.