musl - Re: strcasestr.c

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130217190452.GH20323@brightrain.aerifal.cx>
Date: Sun, 17 Feb 2013 14:04:52 -0500
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: strcasestr.c

On Thu, Feb 14, 2013 at 10:23:49AM -0500, Rich Felker wrote:
> On Thu, Feb 14, 2013 at 09:59:56AM -0500, Todd C. Miller wrote:
> > When investigating using the musl strstr.c ofr OpenBSD I noticed
> > that musl only has a stub for strcasestr() that calls strstr().  I
> > was curious whether the twoway algorithm could be adapted to do a
> > case-insensitive search.  It turned out to be pretty trivial to
> > just add calls to tolower() in the right places, making sure to
> > avoid sign extension.
> > 
> > The changes are mostly mechanical.  You might wish to inline
> > _strcasechr() though the compiler will probably do that for you.
> 
> Unfortunately, as far as I can tell making this correct is nontrivial;
> your version only works for ascii, not the rest of ucs. I don't see
> any easy way to adapt 2way to the case where matching classes contain
> characters of different lengths...

To elaborate, since strcasestr is not a standard function with a
rigorous specification, I find it difficult to determine what the
"correct" behavior should be. It's only documented to "ignore the
case". Does this mean it should operate like a literal regex seach
with REG_ICASE? Or should it operate as if by using single-byte
tolower/toupper functions, in which case your code would be correct?

Actually we already have this problem for strcasecmp, even though it's
specified by POSIX. POSIX just says:

    When the LC_CTYPE category of the current locale is from the POSIX
    locale, strcasecmp() and strncasecmp() shall behave as if the
    strings had been converted to lowercase and then a byte comparison
    performed. Otherwise, the results are unspecified.

It seems POSIX clearly allows implementations to do whatever they like
in non-POSIX locale, but since musl is providing just a POSIX locale
for LC_CTYLE, we're under certain obligations. Note that XBD 7.2 POSIX
Locale allows us, as we're doing, to have character properties and
case mappings outside of those specified in POSIX as long as they're
for characters outside the portable character set:

    The tables in Locale Definition describe the characteristics and
    behavior of the POSIX locale for data consisting entirely of
    characters from the portable character set and the control
    character set. For other characters, the behavior is unspecified.
    For C-language programs, the POSIX locale shall be the default
    locale when the setlocale() function is not called.

Thus, it seems like since we have case mappings for all of Unicode,
strcasecmp needs to honor those, and thus the current implementation
is probably non-conforming.

Since strcasestr is nonstandard and not clearly specified, I don't see
any _obligation_ for it to do the same, but perhaps it should. Thus I
think my previous reply to your patch was premature; your code isn't
necessarily wrong, but it's a matter that needs further consideration
to decide what course of action is appropriate.

Comments from anybody else?

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.