musl - Locale framework RFC

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140627190412.GA13087@brightrain.aerifal.cx>
Date: Fri, 27 Jun 2014 15:04:12 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Locale framework RFC

Background:

One of the agenda items for this release cycle is locale framework.
This is needed both to support a byte-based POSIX locale (the
unfortunate fallout of Austin Group issue #663) and for legitimate
locale purposes like collation, localized time formats, etc.

Note that, at present, musl's "everything is UTF-8" C locale is
already non-conforming to the requirements of ISO C, because it places
extra characters in the C locale's character classes like alpha, etc.
beyond what the standard allows. This could be fixed by making the
definitions of the character classes locale-dependent, but if we just
accept the (bad, wrong, backwards, etc.) new POSIX requirements for
the C/POSIX locale, we get a fix for free: it doesn't matter if
iswalpha(0xc0) returns true in the C locale if the wchar_t value 0xc0
can never be generated in the C locale.

My proposed solution is to provide a backwards C locale where bytes
0x80 through 0xff are interpreted as abstract bytes that decode to
wchar_t values which are either invalid Unicode or PUA codepoints. The
latter is probably preferable since generating invalid codepoints may,
strictly speaking, make it wrong to define __STDC_ISO_10646__.

How does this affect real programs? Not much at all. A program that
hasn't called setlocale() can't expect to be able to use the multibyte
interfaces reasonably anyway, so it doesn't matter that they default
to byte mode when the program starts up. And if the program does call
setlocale correctly (with an argument of "", which means to use the
'default locale', defined by POSIX with the LC_* env vars), it will
get a proper UTF-8-based locale anyway unless the user has explicitly
overridden that by setting LC_CTYPE=C or LC_ALL=C. So really all
that's seriously affected are scripts using LC_CTYPE=C or LC_ALL=C to
do byte-based processing using the standard utilities, and the
behavior of these is "improved".


Implementation:

Three new fields in the libc structure:

1. locale_t global_locale;

This is the locale presently selected by setlocale() and which affects
all threads which have not called uselocale() or which called
uselocale with LC_GLOBAL_LOCALE as the argument.

2. int uselocale_cnt;

uselocale_cnt is the current number of threads with a thread-local
locale. It's incremented/decremented (atomically) by the uselocale
function when transitioning from LC_GLOBAL_LOCALE to a thread-local
locale or vice versa, respectively, and also decremented (atomically)
in pthread_exit if the exiting thread has a thread-local locale. The
purpose of having uselocale_cnt is that, whenever uselocale_cnt is
zero, libc.global_locale can be used directly with no TLS access to
determine if the current thread has a thread-local locale.

3. int bytelocale_cnt_minus_1

This is a second atomic counter which behaves similarly to
uselocale_cnt, except that it is only incremented/decremented when the
thread-local locale being activated/deactivated is non-UTF-8
(byte-based). The global locale set by setlocale is also tracked in
the count, and the result is offset by -1.

Initially at program startup (when setlocale has not been called), the
value of bytelocale_cnt_minus_1 is zero. Setting any locale but "C" or
"POSIX" for LC_CTYLE with setlocale will enable UTF-8 and thus
decrement the value to -1. Setting any thread-local locale to "C" or
"POSIX" for LC_CTYPE will increment the value to something
non-negative.

All functions which are optimized for the sane case of all data being
UTF-8 therefore have a trivial fast-path: if
libc.bytelocale_cnt_minus_1 is negative, they can immediately assume
UTF-8 with no further tests. Otherwise checking libc.uselocale_cnt is
necessary to determine whether to inspect libc.global_locale or
__pthread_self()->locale to determine whether to decode UTF-8 or treat
bytes as abstract bytes.

Per earlier testing I did when Austin Group issue #663 was being
discussed, a single access and conditional jump based on data in the
libc structure does not yield measurable performance cost in UTF-8
decoding. For encoding (wc[r]tomb) there may be a small performance
cost added on archs that need a GOT pointer for GOT-relative accesses
(vs direct PC-relative), since the current code has no GOT pointer.
Fortunately decoding, not encoding, is the performance-critical
operation.

Code which uses locale:

The basic idiom for getting the locale will be:

	locale_t loc = libc.uselocale_cnt && __pthread_self()->locale
		? __pthread_self()->locale
		: libc.global_locale;

And if all that's needed is a UTF-8 flag:

	int is_utf8 = libc.bytelocale_cnt_minus_1<0 || loc->utf8;

where "loc" is the result of the previous expression above. This test
looks fairly expensive, but the only cases with any cost are when
there's at least one thread with a non-UTF-8 locale. Even in the case
where uselocale is in heavy use, as long as it's not being used to
turn off UTF-8, there's no performance penalty.

Components affected:

1. Multibyte functions: must use the above tests to see whether to
process UTF-8 or behave as dumb byte functions. Note that the
restartable multibyte functions (those which use mbstate_t) can skip
the check when the state is not the initial state, since use of the
state after changing locale is UB.

2. Character class functions: should not be affected at all, but we
need to make sure they're all returning false for the characters
decoded from high bytes in bytelocale mode.

3. Stdio wide mode: It's required to bind to the character encoding in
effect at the time the FILE goes into wide mode, rather than at the
time of the IO operation. So rather than using mbrtowc or wcrtomb, it
needs to store the state at the time of enterring wide mode and use a
conversion that's conditional on this saved flag rather than on the
locale.

4. Code which uses mbtowc and/or wctomb assuming they always process
UTF-8: Aside from the above-mentioned use in stdio, this is probably
just iconv. To fix this, I propose adding new functions which don't
check the locale but always process UTF-8. These could also be used
for stdio wide mode, and they could use a different API than the
standard functions in order to be more efficient (e.g. returning the
decoded character, or negative for errors, rather than storing the
result via a pointer argument).

5. MB_CUR_MAX macro: It needs to expand to a function call rather than
an integer constant expression, since it has to be 1 for the new POSIX
locale. The function can in turn use the is_utf8 pattern above to
determine the right return value.

6. setlocale, uselocale, and related functions: These need to
implement the locale switching and above atomic counters logic.

7. pthread_exit: Needs to decrement revelant atomic counters.

8. nl_langinfo and nl_langinfo_l: At present, the only item they need
to support on a per-thread basis is CODESET. For the byte-based C
locale, this could be "8BIT", "BINARY", "ASCII+8BIT" or similar. Here
it needs to be decided whether nl_langinfo should be responsible for
determining the locale_t to pass to nl_langinfo_l, or whether
nl_langinfo_l should accept (locale_t)0 and do its own determination.
This issue will also affect other */*_l pairs that need non-trivial
implementations later.

9. iconv: In addition to the above issue in item 4, iconv should
support whatever value nl_langinfo(CODESET) returns for the C locale
as a from/to argument, even if it's largely useless.

Overall Impact:

Should be near-zero on programs that don't use locale-related
features: a few bytes in the global libc struct and a couple extra
lines in pthread_exit.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.