musl - State of LC_CTYPE conformance

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20140703061107.GA2716@brightrain.aerifal.cx>
Date: Thu, 3 Jul 2014 02:11:07 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: State of LC_CTYPE conformance

Aside from actually providing useful features later on, one of the
motivations for adding the locale framework is to address conformance
issues in musl's current LC_CTYPE behavior for the C locale. However,
I'm not so sure those exist. My analysis of the current situation
follows:

ISO C places requirements not on the character class sets for the
locale (it doesn't really have such a concept) but rather simply on
the behaviors of if the ctype.h and wctype.h functions in the C
locale:

- islower: exactly the 26 lowercase ASCII letters
- isupper: exactly the 26 uppercase ASCII letters
- isalpha: exactly the union of the above two sets
- isblank: exactly space and tab
- ispunct: exactly printable characters for which neither isspace nor
  isalnum is true.
- isspace: exactly the standard 6 space characters

However the wide functions are much less restricted; aside from
iswblank, which returns true exactly for space and tab, none of these
functions have C-locale-specific restrictions.

Thus, as far as I can tell, musl's current behavior is 100% conforming
to the requirements of ISO C. The ctype.h functions behave identically
to ASCII (since the high bytes are invalid) and the wctype.h functions
are free to do full Unicode support.

POSIX, on the other hand, has more restrictive locale requirements.
There is a well-defined set of characters for each class, and the
ctype.h and wctype.h functions for each class are specified in terms
of membership in that set. Thus, for example, POSIX forbids a Latin-1
C locale where the ctype.h functions only reflect ASCII (as required
by ISO C) but the wctype.h functions return true as appropriate for
characters in the range U+0080 to U+00FF.

As far as I can tell, however, POSIX does not forbid musl's C locale,
since the ctype.h functions cannot reflect the presence of multibyte
characters in the set being tested against, and thus remain consistent
with the wctype.h functions and with the ISO C requirement not to
return true for "extra" members.

Does this analysis sound correct? If so, I may hold off on actually
adding byte-based LC_CTYPE for the time being and focus on more
constructive use of the new locale framework.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.