musl - Re: On current (and future) use of LCTRANS

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250615214245.GV1827@brightrain.aerifal.cx>
Date: Sun, 15 Jun 2025 17:42:45 -0400
From: Rich Felker <dalias@...c.org>
To: Markus Wichmann <nullplan@....net>
Cc: musl@...ts.openwall.com,
	Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
Subject: Re: On current (and future) use of LCTRANS

On Sun, Jun 15, 2025 at 08:50:21AM +0200, Markus Wichmann wrote:
> Am Sun, Jun 15, 2025 at 12:38:46AM +0200 schrieb Pablo Correa Gomez:
> > Unfortunately, there are quite some many functions that currently
> > ignore locales when being passed to them:
> > 
> 
> I am wondering why you care so much about the _l variants of these
> functions. IMO, the only sensible implementation of these is:
> 
> 1. if the base function is locale-independent, ignore the locale
>    argument.
> 2. otherwise, wrap the base function in uselocale() calls.

I'm not opposed to making the _l version of some functions primary if
it's cleaner to implement that way, especially if the non-_l version
is not sufficiently a hot-path that the extra call frame wrapping the
_l version is going to matter. For example we're already doing this
with strftime.

There are two main reasons it's the other way around (non-_l primary)
for most things now:

- When the locale_t argument is going to be ignored anyway (because
  the functionality doesn't vary by locale), the call from the _l
  version to the non-_l version can always be a tail call (simple
  jump), but the call from the non-_l version to the _l version would
  require a whole call frame setup on archs where args are passed in
  on the stack or in registers but with a requirement to reserve spill
  space on the stack.

- For functions in the C standard, the _l version is outside the
  reserved namespace, so we need to introduce a namespace-safe symbol
  to call the _l version by the if non-_l version is going to call the
  _l version.

But both of these are issues that can be dealt with if there's a good
reason to do one the other way.

> The main functionality should stay in the base function. And that second
> one then also shows how to avoid the need for any more _l functions in
> future.
> 
> > * is{w}alnum_l
> > * is{w}alpha_l
> > * is{w}blank_l
> > * is{w}cntrl_l
> > * is{w}digit_l
> > * is{w}graph_l
> > * is{w}lower_l
> > * is{w}print_l
> > * is{w}punct_l
> > * is{w}space_l
> > * is{w}upper_l
> > * is{w}xdigit_l
> > * iswctype_l
> > * strfmon_l
> > * to{w}lower_l
> > * to{w}upper_l
> > * towctrans_l
> > 
> 
> Isn't that by design? Except for strfmon, these are all the ctype and
> wctype functions. Those should remain locale independent, shouldn't
> they? musl only supports two runtime codesets, namely ASCII and UTF-8,
> and only one wide character codeset, namely Unicode. (And ASCII only on
> sufferance, because POSIX decided to require MB_CUR_MAX in the POSIX
> locale to be 1).
> 
> So the ctype functions can only ever be their ASCII versions, because
> the ASCII bytes are the only possible single-byte characters (in ASCII
> mode, the high bytes aren't characters, and in UTF-8 mode, the high
> bytes aren't complete characters), and the wctype functions can only
> ever be their Unicode versions, and no locale can ever change that.

Indeed, the reasons these don't do anything with the locale_t argument
and that they weren't included in the plan for the locale overhaul is
that they don't have anything locale-specific to do.

The mapping of the byte-based C locale, to map high bytes onto wchar_t
values that are not valid Unicode Scalar Values, was chosen
intentionally so that the isw*() functions do not classify them as
falling into any of the classes and the tow*() functions don't map
them. This was part of the design discussion when the byte-based C
locale was begrudgingly added added.

Other than that, the encoding is always UTF-8, and character identity
or classification is not locale-specific.

> > We might be able to go without using locales in some of them (like
> > isdigit), but we certainly cannot with others that currently use ASCII
> > codes where letters in other alphabets don't fit.
> > 
> > In addition, we have some functions related to collation where this is
> > also ignored:
> > 
> 
> Well, this is because for now, musl has only been using codepoint
> collation in all locales. But that is what Rich is currently working on,
> isn't it? I don't know how locale-independent the result will be.
> 
> > * {wcs,wcsn,str,strn}casecmp_l
> > * {wcs,str}coll_l
> > * {wcs,str}xfrm_l
> > * wctrans_l
> > * wctype_l
> 
> strcasecmp() is a bad API that is underspecified for multibyte codesets.
> Its specification allows, but does not require, an implementation to
> perform the case mapping in wide-character space, and to deal with
> encoding errors by returning an error, without specifying how to do so.
> For this reason, applications cannot expect any sensible behaviour out
> of it as soon as the input strays from ASCII, and so any locale
> dependency is just moot.
> 
> Also, the casecmp and wctrans functions deal with case mapping, which is
> different from collation, isn't it? And very definitely locale
> independent, except possibly for the multibyte↔widechar conversion.

strcasecmp is indeed underspecified, and I'm not aware of any systems
that implement it a way that would do something useful. At least glibc
certainly does not. Last I checked, they implement it with tolower or
toupper, which was only meaningful in the pre-UTF-8 era.

The wctype/wctrans functions, like the isw*/tow* functions above,
already work and are just not locale-specific for the same reason.

> > In addition to this, we have the RADIXCHAR, which we hard-code in many
> > places while doing transformations. Finding the exact places where it
> > has to be implemented might be more tricky, but a non-exhaustive list:
> > 
> > * vstrfmon_l (internal used by strfmon family)
> > * fmt_ft (internal used by printf family)
> > * dec_float,hex_float (internal used by floatscan family)
> > 
> 
> Rich has in the past expressed his wish to at most allow one other radix
> character, namely the comma, and I support that. Having RADIXCHAR be
> freely definable makes things way overcomplicated for no real gain.
> Imagine someone setting RADIXCHAR to '\r'.

Yes. The localedef system allowing arbitrary radixchar is dangerously
underspecified. '\r' isn't really special except for ugly and
misleading presentation, but things like setting it to '0' would be
much worse, breaking invariants about round-tripping and basically
breaking anything that would parse or format numbers entirely.

On top of that, it's rather meaningless to support arbitrary radix
characters unless you also allow that they be multibyte, and that
breaks all kinds of invariants callers might expect about how much
storage is needed for the output (in other words it potentially
exposes buffer overflow vulns in programs that are otherwise
mostly-correct except for a subtly wrong assumption about locales).

So I am largely against making the radixchar setting anything more
than a 1-bit field that's xor'd with bit 1 of the '.' character.

> And I think those are all the places where float can be converted to a
> string or vice versa.

LC_MONETARY, used by strfmon, has a separate radix char vs LC_NUMERIC.

The only two places LC_NUMERIC radix char should be needed are fmt_fp
in src/stdio/vfprintf.c, and src/internal/floatscan.c for use by
strto{f,d,ld} and *scanf.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.