Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250529140803.GZ1827@brightrain.aerifal.cx>
Date: Thu, 29 May 2025 10:08:04 -0400
From: Rich Felker <dalias@...c.org>
To: Gabriel Ravier <gabravier@...il.com>
Cc: musl@...ts.openwall.com, Markus Wichmann <nullplan@....net>,
	Nuno Cruces <ncruces@...il.com>
Subject: Re: strcasestr("", "") returns NULL

On Thu, May 29, 2025 at 03:47:06PM +0200, Gabriel Ravier wrote:
> On 5/17/25 8:49 AM, Markus Wichmann wrote:
> > Am Fri, May 16, 2025 at 01:37:41PM -0400 schrieb Rich Felker:
> > > On Fri, May 16, 2025 at 05:59:22PM +0100, Nuno Cruces wrote:
> > > > I wasn't suggesting they'd be used, just that they kinda demonstrate the
> > > > problem is not absolutely intractable.
> > > > In any case, if you're interested, glibc seems to use two-way for
> > > > strcasestr, though I'm not sure if it supports every locale.
> > > > I haven't looked into the details.
> > > Interesting -- I wonder how they do it.
> > tolower(). That's how they do it. So yeah, it is just for single-byte
> > locales.
> > 
> > Ciao,
> > Markus
> 
> Glibc's code for strcasestr has a comment that simply states:
> > This function gives unspecified results in multibyte locales.

Yes, I'm well aware that the underlying strcasecmp is unspecified in
other locales. My mistake in opening that can of worms was what
brought us the new POSIX requirement for the byte-based C locale...

> So they don't even try to make it work in those conditions, yeah. I do
> wonder if it's any worse than e.g. strcasecmp though, given that in glibc,
> both strcasecmp and strcasestr simply directly use tolower on input
> characters to canonicalize them before usage in the function but otherwise
> don't do anything special to make the case-insensitive comparison work.
> This is also the case in musl (for strcasecmp, and in practice for
> strcasestr, since it calls strncasecmp which behaves the same as strcasecmp
> w.r.t. how it does the actual comparison), so while I would be far more
> hesitant to directly think simple canonicalization with tolower would work
> with strcasestr than with strcasecmp (especially given glibc's comment in
> strcasestr on giving unspecified results with multibyte locales), I do
> wonder how strcasecmp itself works with multibyte locales in the first
> place... Does it ?

It doesn't.

> Then again, POSIX does state (about strcasecmp and strncasecmp):
> > When the LC_CTYPE category of the locale being used is from the POSIX
> locale, these functions shall behave as if the strings had been converted to
> lowercase and then a byte comparison performed. Otherwise, the results are
> unspecified.
> 
> So if strcasestr was added to POSIX at some point, I would expect POSIX to
> state the same thing about it (anything else would make basically no sense
> unless they also change strcasecmp and strncasecmp to work with multibyte
> locales) - perhaps that implies these functions should be implemented as if
> only single-byte locales matter ?

POSIX does not specify them for single-byte locales in general, *only*
for the C/POSIX locale. Basically, these functions just should not be
used. They are not compatible with multilingual text support, and
they're not even safe to use with ASCII-only input unless your
application rejects the locale system entirely, since just having
LC_CTYPE set to the user's locale may cause them not to work *even on
ASCII bytes*.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.