Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cdaaef9e-1b54-4d83-8400-9bfa6465936b@gmail.com>
Date: Thu, 29 May 2025 15:47:06 +0200
From: Gabriel Ravier <gabravier@...il.com>
To: musl@...ts.openwall.com, Markus Wichmann <nullplan@....net>
Cc: Nuno Cruces <ncruces@...il.com>, Rich Felker <dalias@...c.org>
Subject: Re: strcasestr("", "") returns NULL

On 5/17/25 8:49 AM, Markus Wichmann wrote:
> Am Fri, May 16, 2025 at 01:37:41PM -0400 schrieb Rich Felker:
>> On Fri, May 16, 2025 at 05:59:22PM +0100, Nuno Cruces wrote:
>>> I wasn't suggesting they'd be used, just that they kinda demonstrate the
>>> problem is not absolutely intractable.
>>> In any case, if you're interested, glibc seems to use two-way for
>>> strcasestr, though I'm not sure if it supports every locale.
>>> I haven't looked into the details.
>> Interesting -- I wonder how they do it.
> tolower(). That's how they do it. So yeah, it is just for single-byte
> locales.
>
> Ciao,
> Markus

Glibc's code for strcasestr has a comment that simply states:
 > This function gives unspecified results in multibyte locales.

So they don't even try to make it work in those conditions, yeah. I do 
wonder if it's any worse than e.g. strcasecmp though, given that in 
glibc, both strcasecmp and strcasestr simply directly use tolower on 
input characters to canonicalize them before usage in the function but 
otherwise don't do anything special to make the case-insensitive 
comparison work.
This is also the case in musl (for strcasecmp, and in practice for 
strcasestr, since it calls strncasecmp which behaves the same as 
strcasecmp w.r.t. how it does the actual comparison), so while I would 
be far more hesitant to directly think simple canonicalization with 
tolower would work with strcasestr than with strcasecmp (especially 
given glibc's comment in strcasestr on giving unspecified results with 
multibyte locales), I do wonder how strcasecmp itself works with 
multibyte locales in the first place... Does it ?

Then again, POSIX does state (about strcasecmp and strncasecmp):
 > When the LC_CTYPE category of the locale being used is from the POSIX 
locale, these functions shall behave as if the strings had been 
converted to lowercase and then a byte comparison performed. Otherwise, 
the results are unspecified.

So if strcasestr was added to POSIX at some point, I would expect POSIX 
to state the same thing about it (anything else would make basically no 
sense unless they also change strcasecmp and strncasecmp to work with 
multibyte locales) - perhaps that implies these functions should be 
implemented as if only single-byte locales matter ?


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.