Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aE5tLTapfxLhG_Xj@voyager>
Date: Sun, 15 Jun 2025 08:50:21 +0200
From: Markus Wichmann <nullplan@....net>
To: musl@...ts.openwall.com
Subject: Re: On current (and future) use of LCTRANS

Am Sun, Jun 15, 2025 at 12:38:46AM +0200 schrieb Pablo Correa Gomez:
> Unfortunately, there are quite some many functions that currently
> ignore locales when being passed to them:
> 

I am wondering why you care so much about the _l variants of these
functions. IMO, the only sensible implementation of these is:

1. if the base function is locale-independent, ignore the locale
   argument.
2. otherwise, wrap the base function in uselocale() calls.

The main functionality should stay in the base function. And that second
one then also shows how to avoid the need for any more _l functions in
future.

> * is{w}alnum_l
> * is{w}alpha_l
> * is{w}blank_l
> * is{w}cntrl_l
> * is{w}digit_l
> * is{w}graph_l
> * is{w}lower_l
> * is{w}print_l
> * is{w}punct_l
> * is{w}space_l
> * is{w}upper_l
> * is{w}xdigit_l
> * iswctype_l
> * strfmon_l
> * to{w}lower_l
> * to{w}upper_l
> * towctrans_l
> 

Isn't that by design? Except for strfmon, these are all the ctype and
wctype functions. Those should remain locale independent, shouldn't
they? musl only supports two runtime codesets, namely ASCII and UTF-8,
and only one wide character codeset, namely Unicode. (And ASCII only on
sufferance, because POSIX decided to require MB_CUR_MAX in the POSIX
locale to be 1).

So the ctype functions can only ever be their ASCII versions, because
the ASCII bytes are the only possible single-byte characters (in ASCII
mode, the high bytes aren't characters, and in UTF-8 mode, the high
bytes aren't complete characters), and the wctype functions can only
ever be their Unicode versions, and no locale can ever change that.

Support for other character sets is relegated to the iconv() API, which
can convert anything else into the only sensible choice, UTF-8.

Also note that
    - is{,w}digit can never be changed as per POSIX
    - is{,w}xdigit can only be changed for the alphabetic characters,
      but then the number parsing functions have to be changed to be
      consistent.

> We might be able to go without using locales in some of them (like
> isdigit), but we certainly cannot with others that currently use ASCII
> codes where letters in other alphabets don't fit.
> 
> In addition, we have some functions related to collation where this is
> also ignored:
> 

Well, this is because for now, musl has only been using codepoint
collation in all locales. But that is what Rich is currently working on,
isn't it? I don't know how locale-independent the result will be.

> * {wcs,wcsn,str,strn}casecmp_l
> * {wcs,str}coll_l
> * {wcs,str}xfrm_l
> * wctrans_l
> * wctype_l
> 

strcasecmp() is a bad API that is underspecified for multibyte codesets.
Its specification allows, but does not require, an implementation to
perform the case mapping in wide-character space, and to deal with
encoding errors by returning an error, without specifying how to do so.
For this reason, applications cannot expect any sensible behaviour out
of it as soon as the input strays from ASCII, and so any locale
dependency is just moot.

Also, the casecmp and wctrans functions deal with case mapping, which is
different from collation, isn't it? And very definitely locale
independent, except possibly for the multibyte↔widechar conversion.

> In addition to this, we have the RADIXCHAR, which we hard-code in many
> places while doing transformations. Finding the exact places where it
> has to be implemented might be more tricky, but a non-exhaustive list:
> 
> * vstrfmon_l (internal used by strfmon family)
> * fmt_ft (internal used by printf family)
> * dec_float,hex_float (internal used by floatscan family)
> 

Rich has in the past expressed his wish to at most allow one other radix
character, namely the comma, and I support that. Having RADIXCHAR be
freely definable makes things way overcomplicated for no real gain.
Imagine someone setting RADIXCHAR to '\r'.

And I think those are all the places where float can be converted to a
string or vice versa.

Ciao,
Markus

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.