Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 11 Sep 2019 09:47:27 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: printf doesn't respect locale

On Wed, Sep 11, 2019 at 02:53:36PM +0200, Jens Gustedt wrote:
> Hello Rich,
> 
> On Wed, 11 Sep 2019 07:44:37 -0400 Rich Felker <dalias@...c.org> wrote:
> 
> > On Wed, Sep 11, 2019 at 12:07:22PM +0200, Jens Gustedt wrote:
> 
> > > I think that WG14 would be happy to hear any suggestions how we
> > > could get out of this trap, a proposal for C2x would even be
> > > better.  
> > 
> > The obvious solution is a modifier character to printf/scanf format
> > strings that applies to numeric conversions and means "always
> > format/interpret this as if in the C locale". However this is hard to
> > test for at build time unless there's a macro declaring its
> > availability, so ideally WG14 would also adopt the sort of
> > fine-grained feature availability macros some of us have been
> > proposing for extensions.
> 
> If such a proposal would be made, it would have to be based on a
> reference implementation in the field. Would musl be willing to be
> such a reference implementation?

Possibly, contingent on some willingness of other parties to be on
board with it (even if not implementing it at first). I don't want
musl to be in the position of implementing something new that's not
standardized and likely to *conflict* with future standards, which
custom format flags could do.

> In addition, I would think that it should not switch off all locale
> feature but should leave the encoding properties such as UTF-8
> functional.

Absolutely, but encoding is not relevant to numeric fields. Everything
else is strictly specified, at least for formatting (printf). For
conversion (scanf) implementation-defined locale-specific forms are
also allowed, but this is probably not wanted when you're processing
data from a serialized form that's intended to be universal.

> > An alternative/additional solution, which I actually might like
> > better, is having a function which sets a thread-local flag to treat
> > certain locale properties (at least the problematic LC_NUMERIC ones)
> > as if the current locale were "C". This is weaker than the uselocale
> > API from POSIX, but doesn't have the problems with the possibility of
> > failure (likely with no way to make forward progress) like it does,
> > and more importantly, would avoid *breaking* m17n/i18n functionality
> > by turning off other unrelated, non-problematic locale features.
> > Application or library code could then just set/restore this flag
> > around *printf/*scanf/strto*/etc calls, or could set it and leave it
> > if they never want to see ',' again.
> 
> Interesting.
> 
> Would this be difficult to implement in musl? (I guess not)

I would think not, but I'd have to look at the details a little more.

One other advantage of this approach is that it has a more graceful
fallback. If an application needs portable LC_NUMERIC behavior, it can
check at build time for the presence of the new interface. If present,
LC_NUMERIC can be set to "" (user's preference) and the new interface
can be used to get the needed behavior. If absent, the application can
refrain from setting LC_NUMERIC, only setting the other categories and
leaving it as "C" (default).

Note that having it be thread-locally stateful is, in my opinion, much
better than having new variants of the affected functions or new
formats, since a caller using LC_NUMERIC can set/restore the state to
safely call library code that's completely unaware of the new
interfaces.

Of course there may be complications I haven't thought of. One that
comes to mind right away is what localeconv() should return under such
conditions.

> Would you be willing to write this up?

What form would it need to be in?

> Once we'd have that in musl (even before having it in C2x) it could be
> easier for ourselves to convice us to have full locale support.

By "full" you mean variable radix point? I'm not sure it makes a big
difference in that it won't help code that's not prepared for radix
point to vary. What it does help is making it so code that is being
careful to avoid the breakage can still use LC_NUMERIC when it wants
to, without depending on POSIX.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.