![]() |
|
Message-ID: <cdaaef9e-1b54-4d83-8400-9bfa6465936b@gmail.com> Date: Thu, 29 May 2025 15:47:06 +0200 From: Gabriel Ravier <gabravier@...il.com> To: musl@...ts.openwall.com, Markus Wichmann <nullplan@....net> Cc: Nuno Cruces <ncruces@...il.com>, Rich Felker <dalias@...c.org> Subject: Re: strcasestr("", "") returns NULL On 5/17/25 8:49 AM, Markus Wichmann wrote: > Am Fri, May 16, 2025 at 01:37:41PM -0400 schrieb Rich Felker: >> On Fri, May 16, 2025 at 05:59:22PM +0100, Nuno Cruces wrote: >>> I wasn't suggesting they'd be used, just that they kinda demonstrate the >>> problem is not absolutely intractable. >>> In any case, if you're interested, glibc seems to use two-way for >>> strcasestr, though I'm not sure if it supports every locale. >>> I haven't looked into the details. >> Interesting -- I wonder how they do it. > tolower(). That's how they do it. So yeah, it is just for single-byte > locales. > > Ciao, > Markus Glibc's code for strcasestr has a comment that simply states: > This function gives unspecified results in multibyte locales. So they don't even try to make it work in those conditions, yeah. I do wonder if it's any worse than e.g. strcasecmp though, given that in glibc, both strcasecmp and strcasestr simply directly use tolower on input characters to canonicalize them before usage in the function but otherwise don't do anything special to make the case-insensitive comparison work. This is also the case in musl (for strcasecmp, and in practice for strcasestr, since it calls strncasecmp which behaves the same as strcasecmp w.r.t. how it does the actual comparison), so while I would be far more hesitant to directly think simple canonicalization with tolower would work with strcasestr than with strcasecmp (especially given glibc's comment in strcasestr on giving unspecified results with multibyte locales), I do wonder how strcasecmp itself works with multibyte locales in the first place... Does it ? Then again, POSIX does state (about strcasecmp and strncasecmp): > When the LC_CTYPE category of the locale being used is from the POSIX locale, these functions shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified. So if strcasestr was added to POSIX at some point, I would expect POSIX to state the same thing about it (anything else would make basically no sense unless they also change strcasecmp and strncasecmp to work with multibyte locales) - perhaps that implies these functions should be implemented as if only single-byte locales matter ?
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.