Date: Sat, 6 Jun 2015 17:40:07 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: [PATCH] Byte-based C locale, draft 1 Attached is the first draft of a proposed byte-based C locale. The patch is about 400 lines but most of it is context, because it's basically a lot of tiny changes spread out over lots of files. With this patch applied, the plain "C" (or "POSIX") locale has converts each of the bytes in the range 0x80 to 0xff to a wchar_t value in the range 0xdf80 to 0xdfff, the end of the low surrogates range. I had originally intended to use the range 0x7fffff80 to 0x7fffffff, but C11 introduced mbrtoc16 and c16rtomb, imposing a requirement that all characters in the locale's character set have a mapping into char16_t. The easiest way to achieve this was to use a range of wchar_t values that are already representable in char16_t but that don't overlap with valid characters, and in turn the only way to do that was with unpaired surrogates. The intent is that the wchar_t values produced for high byte in the C locale should not be treated as having any meaning as characters. They are simply UTF-8 code units (in the language of Unicode) and, to reflect this, nl_langinfo(CODESET) returns "UTF-8-CODE-UNITS". Their usefulness is that programs that process data through wchar_t can safely round-trip arbitrary bytes, and, more importantly, regex and fnmatch patterns can be used to match byte patterns instead of character patterns. The logic for how locales are chosen is unchanged, so roughly speaking, the C locale only gets used in applications which either don't use the locale API at all (in which case they should not expect functions that depend on LC_CTYPE to work as expected) or which end up requesting it explicitly or via environment defaults. In particular, the C locale is active only when one of the following applies: - The application has not called setlocale at all for LC_CTYPE. - The application has explicitly requested "C" or "POSIX" for LC_CTYPE in a call to setlocale or newlocale followed by uselocale. - The application has requested the default locale for LC_CTYPE, via an empty string as the locale name or a base of (locale_t)0 and a mask omitting LC_CTYPE_MASK, in a call to setlocale or newlocale followed by uselocale, and the contents of the standard locale-related environment variables yield "C" or "POSIX" for LC_CTYPE. Before applying this I should probably overhaul fnmatch.c again. I believe it has some hard-coded UTF-8 processing code in it for the useless "check the tail before middle" step that I've been wanting to eliminate. Alternatively I could just apply a quick fix to make it work right without any invasive changes. Other than possible weird cases with fnmatch (which are largely harmless but might inhibit matching high bytes in non-UTF-8 mode), this code should be ready for testing. I'd appreciate some feedback from anyone interested in the feature. Rich View attachment "bytelocale_v1.diff" of type "text/plain" (10222 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.