Date: Sun, 17 Feb 2013 14:04:52 -0500 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: strcasestr.c On Thu, Feb 14, 2013 at 10:23:49AM -0500, Rich Felker wrote: > On Thu, Feb 14, 2013 at 09:59:56AM -0500, Todd C. Miller wrote: > > When investigating using the musl strstr.c ofr OpenBSD I noticed > > that musl only has a stub for strcasestr() that calls strstr(). I > > was curious whether the twoway algorithm could be adapted to do a > > case-insensitive search. It turned out to be pretty trivial to > > just add calls to tolower() in the right places, making sure to > > avoid sign extension. > > > > The changes are mostly mechanical. You might wish to inline > > _strcasechr() though the compiler will probably do that for you. > > Unfortunately, as far as I can tell making this correct is nontrivial; > your version only works for ascii, not the rest of ucs. I don't see > any easy way to adapt 2way to the case where matching classes contain > characters of different lengths... To elaborate, since strcasestr is not a standard function with a rigorous specification, I find it difficult to determine what the "correct" behavior should be. It's only documented to "ignore the case". Does this mean it should operate like a literal regex seach with REG_ICASE? Or should it operate as if by using single-byte tolower/toupper functions, in which case your code would be correct? Actually we already have this problem for strcasecmp, even though it's specified by POSIX. POSIX just says: When the LC_CTYPE category of the current locale is from the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified. It seems POSIX clearly allows implementations to do whatever they like in non-POSIX locale, but since musl is providing just a POSIX locale for LC_CTYLE, we're under certain obligations. Note that XBD 7.2 POSIX Locale allows us, as we're doing, to have character properties and case mappings outside of those specified in POSIX as long as they're for characters outside the portable character set: The tables in Locale Definition describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified. For C-language programs, the POSIX locale shall be the default locale when the setlocale() function is not called. Thus, it seems like since we have case mappings for all of Unicode, strcasecmp needs to honor those, and thus the current implementation is probably non-conforming. Since strcasestr is nonstandard and not clearly specified, I don't see any _obligation_ for it to do the same, but perhaps it should. Thus I think my previous reply to your patch was premature; your code isn't necessarily wrong, but it's a matter that needs further consideration to decide what course of action is appropriate. Comments from anybody else? Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.