musl - Re: gettext and locale names

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAA-4+je+sRPjxYqpVQM1SLeSDjoRAeAyFirjZKQ16VpXDGQ-kA@mail.gmail.com>
Date: Fri, 20 May 2016 23:55:45 +0900
From: Masanori Ogino <masanori.ogino@...il.com>
To: musl@...ts.openwall.com
Subject: Re: gettext and locale names

Hello. I'm sorry for the delay.

2016-05-12 8:26 GMT+09:00 Rich Felker <dalias@...c.org>:
> On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote:
>> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@...c.org>:
>> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
>> >> Hello,
>> >>
>> >> When I played with gettext API, I found that musl searches .mo files
>> >> with a directory named as current *full* locale names, e.g.
>> >> en_US.UTF-8. However, we often use shortened names too. Here is a list
>> >> of those names from those of my machine in /usr/share/locale: de,
>> >> en_GB, ru_UA.koi8u, sr@...in, etc.
>> >>
>> >> Due to this mismatch, we can't get translations with musl's gettext
>> >> API for applications in wild. Thus, I'm considering to implement
>> >> locale searching with shortening. Does it make sense?
>> >
>> > Yes, I think this makes sense. Before spending time on the code though
>> > it makes sense to discuss the proposed logic here. What level would
>> > the search/shortening happen at? __get_locale in locale_map.c? In
>> > dcngettext.c?
>>
>> Sure. I doubt that shortening in __get_locale might be insufficient
>> since some code may want the full locale name even if there is no
>> locale data for it. I will dig into the code.
>>
>> Another problem is the preference of shortened locales. Obviously, the
>> full locale itself has the highest priority and language-only locales
>> (e.g. en, de, etc.) do the lowest one. However, which is the preferred
>> locale, en_GB@...o or en_GB.UTF-8, when the code receives
>> en_GB.UTF-8@...o?
>>
>> I am unsure whether someone actually uses such locale, but I think it
>> is necessary to discuss such corner cases.
>
> Conceptually there are two sets of names the locale names need to lead
> us to: libc locales in MUSL_LOCPATH, and gettext translation files in
> directories provided to bindtextdomain. If/when we add non-stub
> catgets support, the locale name is also relevant to NLSPATH
> processing where %L expands to the whole locale name, and %l, %t, and
> %c expand to the language, territory, and codeset parts of it,
> respectively.

I didn't aware of this. Thank you.

> From musl's standpoint all locales are UTF-8-encoded, so the codeset
> portion of the locale name is at best redundant. The official musl
> locale files, once we have such a thing, should not have ".UTF-8" in
> their names, but a spurious ".UTF-8" component in the locale name
> string should be accepted (and ignored) for compatibility with
> glibc-based systems where the specifier may be necessary for
> glibc-linked programs to distinguish from legacy versions of the
> locales.
>
> In principle we could implement this by stripping the ".UTF-8" at
> setlocale time (in __get_locale from locale_map.c) but I don't see a
> major advantage in doing that versus keeping the full string and just
> stripping it when constructing filenames to try opening. On the other
> hand there are advantages to keeping it: some users/distros may want
> to put a spurious ".UTF-8" in the locale name to trick broken programs
> that use strstr on the locale name, rather than nl_langinfo(CODESET),
> to determine that they're in a UTF-8 environment.

I agree with you.

> For gettext translations, I haven't seen ".UTF-8" used either. My
> $prefix/share/locale directories have under them directories only of
> the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern
> gettext-based programs always store UTF-8 in their message catalogs
> and legacy locales are expected to convert the contents when loading.

I also have never seen *.UTF-8 locale directories.

> Based on all this, the search order we perform should probably be
> something like this: First, take the input locale name and strip any
> codeset identifier. Then, iterate over 4 steps:
>
> 1. Try full name.
> 2. If a modifier (@mod) is present, try with modifier removed.
> 3. If a territory (_TT) is present, try with territory removed.
> 4. If both modifer and territory are present, try with both removed.
>
> At worst this yields 4 file-open attempts, and only in the case where
> a user has requested a ll_TT@mod type locale but either the @mod or
> _TT does not exist. For ll_TT type locales, it yields at most 2
> attempts. For ll type locales, or locale names that don't fit the
> standard pattern, there should be at most one attempt.

This algorithm looks good to me.

> From an implementation side, note that, presently, dcngettext uses the
> full pathname of the message catalog file as the key to look up the
> memory-mapped image. This lookup needs to happen without touching the
> filesystem, and the only reason a pathname is used is that it
> encompasses all the necessary key components (bound directory, locale
> name, category name, domain name). So if we go with my above proposal,
> the "pathname" used as a key should still contain the full locale
> name, not the particular fallback it resolved to, and thus might not
> actually be a valid pathname anymore. Because of this it's plausible
> that the same catalog file could end up getting mapped more than once
> (e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and
> /usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major
> cost and I don't think it's worth trying to detect and avoid.
>
> Does this all make sense? Does it sound reasonable?

Yes, I think it makes sense. Thanks for the suggestion.

After checking gettext-tools to confirm that .mo files are encoded to
UTF-8, I will try to implement the algorithm.

-- 
Masanori Ogino
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.