Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 31 Dec 2015 10:37:58 +0100
From: Markus Wichmann <nullplan@....net>
To: musl@...ts.openwall.com
Subject: Re: Patches: Timezone in %c and POT file

On Wed, Dec 30, 2015 at 10:58:48AM -0500, Rich Felker wrote:
> On Wed, Dec 30, 2015 at 11:56:33AM +0100, Markus Wichmann wrote:
> > Hi all,
> > 
> > Now I have subscribed, so CC'ing me is no longer necessary.
> > 
> > Today I worked on two things: Firstly, I put the timezone into
> > strftime's %c output. The reason is that glibc's strftime() does the
> > same. That means, that an application dev currently can't depend on
> > either behavior (so strftime("%c %Z") will give me the timezone twice on
> > glibc, but only once on musl, and my app won't be able to tell without
> > inspecting the resulting string).
> > 
> > No biggie, changing that one is easy. Of course, a heated argument can
> > be had over whether or not we want it one way or the other. And it'll
> > come down to personal taste, because as far as I'm aware, POSIX isn't
> > mandating anything about this.
> 
> The format for %c in the C locale is strictly specified by ISO C as
> "%a %b %e %T %Y"; see 7.27.3.5 ΒΆ 7. If glibc does not match this it's
> a bug in glibc. POSIX is of course aligned with ISO C and says the
> same thing. In other locales it's permitted to differ.
> 

Ah, sorry, I didn't check the C locale. And I didn't look up strftime()
in the C standard. I did look up D_T_FMT in POSIX, though. Ah well, it
happens.

Attached are two new patches, the first reverting this one, the second
updating the POT and the PO to reflect that change.

> > Then I noticed, that for quite some time now, musl has been supporting
> > ..mo files, but no infrastructure is in place for them (i.e. no POT file
> > nor any PO file is shipped). I tried searching around for POT or PO
> > file, but I couldn't find any. So I added a handwritten POT file and a
> > German PO file (I'm not proficient enough in any languages besides
> > English and German to want to create that file for any other languages.
> > And an English PO file would be kind of redundant.)
> > 
> > I filled the POT file with all the strings I could find, that would ever
> > be plugged into __lctrans(). That gives me strerror(), strsignal(),
> > gai_strerror(), hstrerror(), and __getopt_msg() strings.
> 
> Have you read the thread "Call for locales maintainer & contributors"
> from when locale support launched? Here's a link to the start of it:
> 
> http://www.openwall.com/lists/musl/2014/07/24/14
> 

No, I haven't. I'm looking into it now. I did search for a musl locale
repository and couldn't find any, so that's why I sent these patches.

> It might have some useful ideas. The main one I'd like to point out is
> the idea to develop and maintain locales as a separate repo outside of
> the source repo. Unlike glibc, we don't have a lot of messages that
> should be expected to change frequently, so I think the issues with
> keeping sync are minimal, while there are several advantages:
> 
> - Not having translation progress stalled-by/tied-to code release
>   cycles.
> 
> - Saving users who don't want locales from having to download them.
> 
> - No need to have locale patches go through me.
> 

Since we're going with gettext's MO files, the separate repo might work,
but I think you should at least keep an up-to-date POT file in the musl
repo. That way the locale repo just has to check whether the POT file is
still up to date, and if not, what changed, to be able to update all the
language POs.

Also, keeping at least a POT file around allows people interested in
translation work to get into it way more quickly, even without the need
for a repo. After the POT file was done, the German PO was a matter half
an hour at the most, and incidentally, German is the only translation I
was interested in. Since locale is still very much a DIY thing, that
wouldn't be so bad.

But then we probably should annotate the weirder entries (the
nl_langinfo() stuff).

> > Unfortunately this design is running into some problems: At the moment
> > several strings are empty in the C locale (which is fine), but they
> > could translate to something else in some other locale (nl_langinfo()'s
> > ERA* and THOUSEP come to mind).
> 
> Yes. Those are unsupported right now, along with a lot of related
> functionality. There's also no way to set the fields of localeconv(),
> which come mostly (entirely, I think) from LC_MONETARY and LC_NUMERIC.
> Depending on how we end up representing that data in the locale file,
> it might make sense to use some sort of preprocessing script to
> generate this part of the .po file, but I'd like to have the format
> just be simple and natural to do in .po if that doesn't impose heavy
> code or runtime overhead.
> 
> > Some strings in the C locale are the
> > exact same and might translate to something else in some language (the
> > long and short forms of "May" for instance). I think glibc solves that
> > problem with another file format for libc's locales, which is a headache
> > I don't want to think about this year anymore.
> 
> "May" is a good example. Yes, I've never much liked the gettext model
> of keying by untranslated/English string, but for translation it's the
> only one that's translator-friendly, and for musl it was the only
> choice that saved us from having to develop a new file format and code
> to handle it.
> 
> The easiest solution I've come up with is prefixing and doing
> something like __lctrans("<prefix>string")+prefixlen, ideally with a
> prefixlen of 1, e.g. __lctrans("\5May")+1. This would just add 1 byte
> to each string in the built-in C locale data and one inc/add
> instruction, not a significant cost. Do you have any other good ideas?
> 

Well, the prefix idea is good, but it requires changes to all the
interfaces calling __lctrans() and to POT and PO file. And all of that
for a potential problem we don't even have right now. (BTW: Despite all
my searching, I couldn't find out what POSIX means by "era".)

Alternatively we could go the other route and map those string lists
into memory from locale files. I'm thinking of a file that contains a
string like c_time or c_messages, or better yet: All of them. Then we
prefix that file with offsets to where the strings start... Loading
would just be mmap() and setting a couple pointers, and using it would
be just loading str from a different source in nl_langinfo()...

Oh no, that would be the "custom file format" route. Viable, but also a
lot of work. To create those files we'd need tools and before long we'd
be reinventing MO files.


No, as it stands, I'd go with "Let's cross that bridge when we come to
it". musl doesn't support a lot of things that would be necessary for
full locale support (and I don't particularly want it to. Sure, it's
annoying to have to cut long numbers up into groups of three digits
manually, but the code to support it in the POSIX way would just be
insane).

> > [...]
> > +
> > +msgid "."
> > +msgstr ","
> 
> musl explicitly does not support changing the radix point; there's an
> old thread on this topic I can dig up if you'd like to read it. It
> looks to me like nl_langinfo(RADIXCHAR) will return a replacement if
> the locale file defines one, but then you get inconsistent results
> since it won't be used (e.g. by printf or strtod). Probably
> nl_langinfo should avoid passing the "." to __lctrans at all so that
> this inconsistency can't arise. This would also allow us to support
> "mon_decimal_point" (which would otherwise be a duplicate untranslated
> string) if desired, I think.
> 

Oh yes, I remember reading that.

Well, we could remove that string from the PO file. And the POT file. I
haven't yet done that in the appended patches, but you can do that after
applying them (then I don't get a merge conflict), but yeah, printing
numbers in the local way is also something even glibc tries to avoid.
Also I doubt it is necessary.

Since few libcs offer local number support, programs that try to offer
local number I/O have to work around that, anyway (changing commas to
dots before passing it to strtod(), filtering out thousands separators,
etc.) so few if any applications need that. I only put it there for
completeness' sake.

Ciao,
Markus

View attachment "0001-Revert-Add-timezone-to-strftime-s-c.patch" of type "text/x-diff" (777 bytes)

View attachment "0002-Necessary-changes-for-correct-C-locale.patch" of type "text/x-diff" (1188 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.