Date: Fri, 20 May 2016 23:55:45 +0900 From: Masanori Ogino <masanori.ogino@...il.com> To: musl@...ts.openwall.com Subject: Re: gettext and locale names Hello. I'm sorry for the delay. 2016-05-12 8:26 GMT+09:00 Rich Felker <dalias@...c.org>: > On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote: >> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@...c.org>: >> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote: >> >> Hello, >> >> >> >> When I played with gettext API, I found that musl searches .mo files >> >> with a directory named as current *full* locale names, e.g. >> >> en_US.UTF-8. However, we often use shortened names too. Here is a list >> >> of those names from those of my machine in /usr/share/locale: de, >> >> en_GB, ru_UA.koi8u, sr@...in, etc. >> >> >> >> Due to this mismatch, we can't get translations with musl's gettext >> >> API for applications in wild. Thus, I'm considering to implement >> >> locale searching with shortening. Does it make sense? >> > >> > Yes, I think this makes sense. Before spending time on the code though >> > it makes sense to discuss the proposed logic here. What level would >> > the search/shortening happen at? __get_locale in locale_map.c? In >> > dcngettext.c? >> >> Sure. I doubt that shortening in __get_locale might be insufficient >> since some code may want the full locale name even if there is no >> locale data for it. I will dig into the code. >> >> Another problem is the preference of shortened locales. Obviously, the >> full locale itself has the highest priority and language-only locales >> (e.g. en, de, etc.) do the lowest one. However, which is the preferred >> locale, en_GB@...o or en_GB.UTF-8, when the code receives >> en_GB.UTF-8@...o? >> >> I am unsure whether someone actually uses such locale, but I think it >> is necessary to discuss such corner cases. > > Conceptually there are two sets of names the locale names need to lead > us to: libc locales in MUSL_LOCPATH, and gettext translation files in > directories provided to bindtextdomain. If/when we add non-stub > catgets support, the locale name is also relevant to NLSPATH > processing where %L expands to the whole locale name, and %l, %t, and > %c expand to the language, territory, and codeset parts of it, > respectively. I didn't aware of this. Thank you. > From musl's standpoint all locales are UTF-8-encoded, so the codeset > portion of the locale name is at best redundant. The official musl > locale files, once we have such a thing, should not have ".UTF-8" in > their names, but a spurious ".UTF-8" component in the locale name > string should be accepted (and ignored) for compatibility with > glibc-based systems where the specifier may be necessary for > glibc-linked programs to distinguish from legacy versions of the > locales. > > In principle we could implement this by stripping the ".UTF-8" at > setlocale time (in __get_locale from locale_map.c) but I don't see a > major advantage in doing that versus keeping the full string and just > stripping it when constructing filenames to try opening. On the other > hand there are advantages to keeping it: some users/distros may want > to put a spurious ".UTF-8" in the locale name to trick broken programs > that use strstr on the locale name, rather than nl_langinfo(CODESET), > to determine that they're in a UTF-8 environment. I agree with you. > For gettext translations, I haven't seen ".UTF-8" used either. My > $prefix/share/locale directories have under them directories only of > the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern > gettext-based programs always store UTF-8 in their message catalogs > and legacy locales are expected to convert the contents when loading. I also have never seen *.UTF-8 locale directories. > Based on all this, the search order we perform should probably be > something like this: First, take the input locale name and strip any > codeset identifier. Then, iterate over 4 steps: > > 1. Try full name. > 2. If a modifier (@mod) is present, try with modifier removed. > 3. If a territory (_TT) is present, try with territory removed. > 4. If both modifer and territory are present, try with both removed. > > At worst this yields 4 file-open attempts, and only in the case where > a user has requested a ll_TT@mod type locale but either the @mod or > _TT does not exist. For ll_TT type locales, it yields at most 2 > attempts. For ll type locales, or locale names that don't fit the > standard pattern, there should be at most one attempt. This algorithm looks good to me. > From an implementation side, note that, presently, dcngettext uses the > full pathname of the message catalog file as the key to look up the > memory-mapped image. This lookup needs to happen without touching the > filesystem, and the only reason a pathname is used is that it > encompasses all the necessary key components (bound directory, locale > name, category name, domain name). So if we go with my above proposal, > the "pathname" used as a key should still contain the full locale > name, not the particular fallback it resolved to, and thus might not > actually be a valid pathname anymore. Because of this it's plausible > that the same catalog file could end up getting mapped more than once > (e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and > /usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major > cost and I don't think it's worth trying to detect and avoid. > > Does this all make sense? Does it sound reasonable? Yes, I think it makes sense. Thanks for the suggestion. After checking gettext-tools to confirm that .mo files are encoded to UTF-8, I will try to implement the algorithm. -- Masanori Ogino
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.