Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Mon, 5 Mar 2018 12:10:27 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Bikeshed invitation for nl_langinfo ambiguities

On Sat, Mar 03, 2018 at 12:08:54AM -0500, Rich Felker wrote:
> On Sun, Nov 26, 2017 at 05:19:07PM -0600, A. Wilcox wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> > 
> > On 10/11/17 20:06, Rich Felker wrote:
> > > I've found 2 ambiguous-string-to-translate bugs in musl's locale 
> > > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and
> > > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C
> > > locale, and thus can't be translated to distinct values like they
> > > need to be in other locales.
> > > 
> > > Any opinions on the cleanest way to handle this? There are various 
> > > hacks I could do at the implementation level, like adding a prefix 
> > > character to one or the other then applying +1 to the output
> > > string, But whatever solution we choose becomes a public interface
> > > for translators, so it should be something that's not horribly
> > > ugly.
> > 
> > I would personally recommend actually using the enum values as the
> > strings to translate.  _("MON_5"), _("ABMON_5"), etc; this is
> > non-ambiguous, easily understandable and describable for translators,
> > and does not require weird hacks at the implementation or ABI level.
> 
> I think this may be the nicest approach, despite being an incompatible
> change from the existing system, which apparently doesn't matter and
> isn't being used or people would have noticed that "May" can't be
> translated right.

One really ugly thing here is that the POSIX key for weekdays is
"highly unconventional" - ABDAY_1/DAY_1 is Sunday and ABDAY_7/DAY_7 is
Saturday. Even the Unicode CLDR noticed this nonsense and used
"sun"..."sat" as the keys rather than using numbers so as to be
unambiguous.

> > Of course, then a "C" / "POSIX" strings file must be present.  But
> > this is, in my opinion, a very small sacrifice to ensure full purity
> > and ease of translation.
> 
> As noted before, obviously this isn't acceptable. We could drop a .mo
> file blob in the musl langinfo.c, but I think it might make more sense
> to just use different code paths for translated vs nontranslated case.

I did some simple estimates with a toy .po/.mo file, and it looks like
either of those approaches is going to more-than-double the size of
langinfo.o, and make it a lot more complex. Given that "Sun".."Sat"
are nicer keys for days anyway, I'm leaning back towards sticking with
what we have and just adding a special case for "May". The other
ambiguity is one of the ERA_* formats, which we're not even doing
right now anyway; they're "not available in the POSIX locale"
according to XBD 7.3.5 LC_TIME, so as I read it they should return ""
(not the correspondign non-era string) in the C/POSIX locale, and only
return something else if they're defined for the locale. Eventually,
we should probably look them up with mo keys like "era_d_fmt", etc.
but unless/until we properly support them, the lookups for them should
just be removed.

> Then we could just synthesize the keys (ABMON_*, MON_*, ABDAY_*,
> DAY_*) to pass into LCTRANS() rather than having a table of them all
> expanded out. I might change my mind when actually working out how the
> code would look, though.

I started working on a nice means of doing this synthesis - having a
table like the existing c_time etc. but contents like:

	"ABDAY_1\0\0\0\0\0\0\0"
	"DAY_1\0\0\0\0\0\0\0"
	"ABMON_1\0\0\0\0\0\0\0\0\0\0\0\0"
	"MON_1\0\0\0\0\0\0\0\0\0\0\0\0"
	...

where, when a zero-length entry is hit, the last non-zero-length one
seen gets used as a basis for synthesis. But it still didn't seem
possible to avoid significant increase in code size and complexity.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.