Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 23 Jul 2014 12:39:07 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Locale bikeshed time

On Wed, Jul 23, 2014 at 11:50:31AM +0200, u-igbb@...ey.se wrote:
> > > I actually do mix categories from different locales.
> > > No problem as long as the files are small.
> > 
> > Note that if you're just mixing "ll_TT" and "C", there wouldn't be any
> > cost anyway since the C locale (and its aliases) are builtin and never
> > loaded from a file. Where I was thinking you might see duplication is
> 
> Sure. This covers certainly most of my preferences but I thought of
> LANG=l1_T1 and LC_SOMETHING=l2_T2 [and LC_SOMETHINGELSE=l3_T3].
> This would result in pulling in two or three locale data files but the
> overhead is presumably negligible.

It's two or three sets of syscalls -- open (one per path component
tried until it succeeds), fstat, mmap, close -- rather than one set.
And an extra vma (resulting from the mmap) for each used. But the
choice isn't whether to have this overhead or not, unless you want to
consider the glibc locale-archive ugliness. The choice is just whether
to optimize the case where the categories are all the same (only
having one set of syscalls in that case) or mostly the same, or not to
optimize it and always have multiple sets of syscalls. I believe the
latter is strictly worse.

> > for things like: LC_ALL=ll_TT@...ifier where modifier is really just
> > an alternate for one category (e.g. ISO date format for time, alt
> > collation order, etc.), but the file ends up storing duplicates of all
> > the data from other categories. However, I think the alternate
> > preferred usage here would be to provide a file for just the category
> > being overridden that does not contain the base data and require users
> > to set the individual categories, like what you're doing, e.g.
> 
> > LANG=ll_TT LC_TIME=ll_TT@...date
> 
> This means that most of the time there will be a single locale file to be
> opened, sometimes more, in extreme cases up to the number of categories,
> the files also being of different "completeness". This would certainly
> contribute to confusion for both the administrators and the users.

Hmm. I see how it would be confusing and maybe it's best to discourage
this use (incomplete .mo files). But it's purely a useage issue,
outside of musl'c control, unless we wanted to impose a check that any
locale file have data for all the categories (and I think such a rule
would be bad since it precludes having locale files that are unrelated
to languages, e.g. a generic "UCA" collation locale with the default
UCA data).

> For the sake of uniformity I would possibly prefer to see only the
> "thinner" files defining exactly one category, instead of different
> files having different numbers of included categories.

Yes that sounds like a good policy. Really, policy matters like this
(i.e. ones that don't affect libc implementation) should be worked out
when it comes time to actually make some locales and find a maintainer
for a musl-locale repo/package.

On that topic, while this is a matter outside my control for
individual users, my preference would be that the official musl-locale
data attempt to avoid multiple variants/modifiers and legacy options
if possible. For example I would like to see the numeric date format
be ISO format in all locales, with traditional formats only where the
natural-language string representations for months/days are included
(and I say this as someone coming from one of the locales, i.e. US,
where the traditional numeric date format is non-ISO). In keeping with
the principle that musl is "modern" I'd like to prefer modern cultural
conventions to historical ones.

> But most of all I'd support your approach of including all information in
> each file. This is "least confusing" and quite efficient. The overhead
> is mostly static storage (not noticeable in our setup and probably not
> much anyway :) and the run time overhead affects just the minority of
> users who mix locales/categories. (Oh btw as a nice bonus this makes
> the file boundaries correspond to the data usage patterns).
> 
> To summarize my view,
> 
> - a file per locale, with all categories included  best
> - a file per category                              acceptable
> - files with differing data subsets                please don't

Yes I think this makes sense. My leaning would be to use complete
files for language-based locales, and file-per-category for individual
category locales that are not associated with any particular language
(and where, thereby, there's no assumption that they should provide
any behavior to other categories).

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.