![]() |
|
Message-ID: <aE5tLTapfxLhG_Xj@voyager> Date: Sun, 15 Jun 2025 08:50:21 +0200 From: Markus Wichmann <nullplan@....net> To: musl@...ts.openwall.com Subject: Re: On current (and future) use of LCTRANS Am Sun, Jun 15, 2025 at 12:38:46AM +0200 schrieb Pablo Correa Gomez: > Unfortunately, there are quite some many functions that currently > ignore locales when being passed to them: > I am wondering why you care so much about the _l variants of these functions. IMO, the only sensible implementation of these is: 1. if the base function is locale-independent, ignore the locale argument. 2. otherwise, wrap the base function in uselocale() calls. The main functionality should stay in the base function. And that second one then also shows how to avoid the need for any more _l functions in future. > * is{w}alnum_l > * is{w}alpha_l > * is{w}blank_l > * is{w}cntrl_l > * is{w}digit_l > * is{w}graph_l > * is{w}lower_l > * is{w}print_l > * is{w}punct_l > * is{w}space_l > * is{w}upper_l > * is{w}xdigit_l > * iswctype_l > * strfmon_l > * to{w}lower_l > * to{w}upper_l > * towctrans_l > Isn't that by design? Except for strfmon, these are all the ctype and wctype functions. Those should remain locale independent, shouldn't they? musl only supports two runtime codesets, namely ASCII and UTF-8, and only one wide character codeset, namely Unicode. (And ASCII only on sufferance, because POSIX decided to require MB_CUR_MAX in the POSIX locale to be 1). So the ctype functions can only ever be their ASCII versions, because the ASCII bytes are the only possible single-byte characters (in ASCII mode, the high bytes aren't characters, and in UTF-8 mode, the high bytes aren't complete characters), and the wctype functions can only ever be their Unicode versions, and no locale can ever change that. Support for other character sets is relegated to the iconv() API, which can convert anything else into the only sensible choice, UTF-8. Also note that - is{,w}digit can never be changed as per POSIX - is{,w}xdigit can only be changed for the alphabetic characters, but then the number parsing functions have to be changed to be consistent. > We might be able to go without using locales in some of them (like > isdigit), but we certainly cannot with others that currently use ASCII > codes where letters in other alphabets don't fit. > > In addition, we have some functions related to collation where this is > also ignored: > Well, this is because for now, musl has only been using codepoint collation in all locales. But that is what Rich is currently working on, isn't it? I don't know how locale-independent the result will be. > * {wcs,wcsn,str,strn}casecmp_l > * {wcs,str}coll_l > * {wcs,str}xfrm_l > * wctrans_l > * wctype_l > strcasecmp() is a bad API that is underspecified for multibyte codesets. Its specification allows, but does not require, an implementation to perform the case mapping in wide-character space, and to deal with encoding errors by returning an error, without specifying how to do so. For this reason, applications cannot expect any sensible behaviour out of it as soon as the input strays from ASCII, and so any locale dependency is just moot. Also, the casecmp and wctrans functions deal with case mapping, which is different from collation, isn't it? And very definitely locale independent, except possibly for the multibyte↔widechar conversion. > In addition to this, we have the RADIXCHAR, which we hard-code in many > places while doing transformations. Finding the exact places where it > has to be implemented might be more tricky, but a non-exhaustive list: > > * vstrfmon_l (internal used by strfmon family) > * fmt_ft (internal used by printf family) > * dec_float,hex_float (internal used by floatscan family) > Rich has in the past expressed his wish to at most allow one other radix character, namely the comma, and I support that. Having RADIXCHAR be freely definable makes things way overcomplicated for no real gain. Imagine someone setting RADIXCHAR to '\r'. And I think those are all the places where float can be converted to a string or vice versa. Ciao, Markus
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.