musl - Re: Locale bikeshed time

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140726072502.GR16795@example.net>
Date: Sat, 26 Jul 2014 09:25:03 +0200
From: u-igbb@...ey.se
To: musl@...ts.openwall.com
Subject: Re: Locale bikeshed time

On Fri, Jul 25, 2014 at 06:32:39PM -0400, Rich Felker wrote:
> > Somewhat cleaner might be:   ("zxx" and "ZZ" below are literals)
> > 
> >    no localization                         C
> >    language[+territory]                    ll[l][_TT]
> >    purely territorial                      zxx_TT   ("no language" code)
> 
> While clean and well-defined, I wonder whether zxx_TT is
> counter-intuitive to most users...

Sure this contradicts the all that convenient inclination to use short
names when possible. Nevertheless I would even argue against myself
(again :) and say that we'd better disallow short variants altogether
(no TT, nor ll).

> > I think that a language code alone should mean "no territory-specific
> > stuff included" and nothing else.
> 
> I think that's reasonable.

Givet that we'd need both this extra rule and a hope that the future
user/maintainer keeps it in mind too

> > Then "ll" would be a synonym for "ll_ZZ" and hence "ll_ZZ" will not have
> > to exist at all.

it would be in fact more robust to to the contrary simply always
assume the full ll[l]_TT syntax, with zxx and ZZ being already defined
by the corresponding standards to denote the needed special cases.

Then this would be fully standard-compliant and consistent.

I understand this may feel a bit strange and "too long" even though
the extra characters are hardly a burden in practice.

Let me compare this to the dns search domains - short names seem
convenient but they are not reliable nor do scale. Short locale names
as as well prone to be misunderstood and there will be contributions
with different semantics and long bikeshed discussions on different
forums about which one is right :)

In other words, I feel that it is more clear to _not_ include
Sweden-specific bits into "sv_ZZ" (which indicates "not _any_ country"
and hence "not Sweden") than into "sv".

> > LANG=sv_SE                        (decimal comma, "kr")
> > LANG=sv     LC_MONETARY=zxx_SE    (decimal point from "C", iso4217 "SEK")
> 
> Changing the numeric radix point is explicitly not supported. :)
> LC_NUMERIC is just always C because, well, numbers are numbers, not
> something to vary by culture, and changing the radix point just breaks
> parsing and storing data for interchange. LC_MONETARY on the other

I am fully with you on the point of formatting numerical data for
intechange. The purpose of locale is though the exact _opposite_, to
represent data in a format especially chosen for the specific occasion
and a specific user, _differently_ from what would be suitable for the
rest of the world. Isn't it?

So I would say it is indeed stupid to localize data meant for
interchange. Nevertheless it may still be meaningful to format numbers
for the user's taste when the data presentation is only meant for some
kind of a "local" context.

Related to the decimal point issue:

I think we (or at least myself) would need a clarification about
the role of "C" locale. It is to mean "no localization" which does
not say that it is expected to provide representation usable globally
(I think it is on the contrary by its origin heavily English/US biased).

I assume that you are aiming to reduce this bias as much as possible so
that "C" could be neutral and suitable for as many users/uses as possible.
Unfortunately this raises more questions, like the following:

According to https://en.wikipedia.org/wiki/Decimal_mark

"
Countries where a dot "." is used to mark the radix point comprise
roughly 60% of the world's population.[citation needed]
"

which indicates that this information is unreliable.

Notably, according to the same article (and verifiably :) the living
auxiliary languages meant for international communication all made a
different choice (apparently for reasons based on some research):

"
The three most spoken international auxiliary languages, Ido, Esperanto,
and Interlingua all use the comma as the official radix point
"

Is there anything that postulates C locale to use "." as the radix point?
Is there any evidence that "." is more widely used than "," ?

Do not misunderstand my questions as a cultural bias. I am _much_
more used to the decimal dot than comma, because of the involvement
with programming languages using ".". Nevertheless locale is not about
representing data for computers, but for humans - and I would love to
have a best possible internationally useful locale as the default.

Otherwise let us say that "C" locale is for interacting with programs,
not with humans, period (those wishing a human-friendly internationally
sound environment are to use e.g. LANG=eo_ZZ).
This is possibly the only reliable/efficient/robust approach?

Yet it would be a pity to not have a common representation for both
humans and computers, without a cultural bias.

Rune
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.