Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 21 May 2015 23:04:47 -0500
From: Josiah Worcester <>
Subject: Re: Revisiting byte-based C locale

On Thu, May 21, 2015 at 9:22 PM, Rich Felker <> wrote:
> The last time the the byte-based C locale topic was visited ("Possible
> bytelocale patch",,
> it was a rather ugly patch introducing lots of code duplication. Now,
> I believe the callers of multibyte/wide char functions which need to
> always work in UTF-8 mode (iconv) or need to match a previously-saved
> mode (stdio wide functions, which save the encoding in the FILE when
> it becomes wide-oriented) can simply swap __pthread_self()->locale
> back and forth. There is no longer a possibility that the thread
> pointer may be uninitialized, nor a heavy synchronization cost of
> switching thread-local locales from the atomics in uselocale -- commit
> 68630b55c0c7219fe9df70dc28ffbf9efc8021d8 removed all that.
> Thus, I think we're at a point where we can evaluate the choice to
> support or not to support a byte-based C locale on the basis of things
> like standards conformance and impact on users and on software
> compatibility without having to weigh implementation costs (which
> would have contributed to "impact on users").
> Since last year, the issue of byte-based C locale has come up a few
> more times as a stumbling point for users on the IRC channel and/or
> mailing list (I forget which and haven't gone back to look it up yet).
> In particular, broken configure tests passing binary data to grep
> failed, and I believe one or more language interpreters loading source
> files in the C locale errored out due to a Latin-1 encoded "©"
> character in source comments. Personally I'm in favor of getting the
> broken stuff fixed, but I can see both sides.
> There are also minor conformance reasons to consider the byte-based C
> locale even without accepting the resolution to Austin Group issue 663
> (which is supposedly imposing the requirement, someday). In
> particular, the C standard seems to allow the current behavior of
> musl, where the C locale has extra characters for which isw*() return
> true, as long as the non-wide is*() functions don't have such extra
> characters. C doesn't even define abstract character classes that
> these functions report, just loose requirements on their behavior. But
> POSIX specifies LC_CTYPE in terms of character classes which have
> members, and does not leave room for extra characters in the C locale
> as far as I can tell. This could affect real-world usage cases where
> an application intentionally running in the C locale expects the
> regex/fnmatch bracket [[:alpha:]] not to match anything but ASCII
> letters. As mentioned several times in the past, this non-conformance
> could be addressed by changes in the isw*() functions (making them
> locale-aware) rather than by adding the byte-based C locale, but if
> there are other motivations to support the byte-based C locale, it
> may make sense to solve both issues with one change.
> Any new opinions on the topic? Or interest in re-emphasizing a
> previously stated opinion? :)
> Rich

Given the POSIX rules on LC_CTYPE character classes effecting
[[:alpha:]], it seems to me now that the clear intent (if not
statement) is in fact for a byte-based C locale. Though maybe
unfortunate, it does seem like as though that is in fact the most
conformant way of doing it, and conforming looks to have little cost

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.