musl - [PATCH] Byte-based C locale, draft 1

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150606214007.GA17398@brightrain.aerifal.cx>
Date: Sat, 6 Jun 2015 17:40:07 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: [PATCH] Byte-based C locale, draft 1

Attached is the first draft of a proposed byte-based C locale. The
patch is about 400 lines but most of it is context, because it's
basically a lot of tiny changes spread out over lots of files.

With this patch applied, the plain "C" (or "POSIX") locale has
converts each of the bytes in the range 0x80 to 0xff to a wchar_t
value in the range 0xdf80 to 0xdfff, the end of the low surrogates
range. I had originally intended to use the range 0x7fffff80 to
0x7fffffff, but C11 introduced mbrtoc16 and c16rtomb, imposing a
requirement that all characters in the locale's character set have a
mapping into char16_t. The easiest way to achieve this was to use a
range of wchar_t values that are already representable in char16_t but
that don't overlap with valid characters, and in turn the only way to
do that was with unpaired surrogates.

The intent is that the wchar_t values produced for high byte in the C
locale should not be treated as having any meaning as characters. They
are simply UTF-8 code units (in the language of Unicode) and, to
reflect this, nl_langinfo(CODESET) returns "UTF-8-CODE-UNITS". Their
usefulness is that programs that process data through wchar_t can
safely round-trip arbitrary bytes, and, more importantly, regex and
fnmatch patterns can be used to match byte patterns instead of
character patterns.

The logic for how locales are chosen is unchanged, so roughly
speaking, the C locale only gets used in applications which either
don't use the locale API at all (in which case they should not expect
functions that depend on LC_CTYPE to work as expected) or which end up
requesting it explicitly or via environment defaults. In particular,
the C locale is active only when one of the following applies:

- The application has not called setlocale at all for LC_CTYPE.

- The application has explicitly requested "C" or "POSIX" for LC_CTYPE
  in a call to setlocale or newlocale followed by uselocale.

- The application has requested the default locale for LC_CTYPE, via
  an empty string as the locale name or a base of (locale_t)0 and a
  mask omitting LC_CTYPE_MASK, in a call to setlocale or newlocale
  followed by uselocale, and the contents of the standard
  locale-related environment variables yield "C" or "POSIX" for
  LC_CTYPE.

Before applying this I should probably overhaul fnmatch.c again. I
believe it has some hard-coded UTF-8 processing code in it for the
useless "check the tail before middle" step that I've been wanting to
eliminate. Alternatively I could just apply a quick fix to make it
work right without any invasive changes.

Other than possible weird cases with fnmatch (which are largely
harmless but might inhibit matching high bytes in non-UTF-8 mode),
this code should be ready for testing. I'd appreciate some feedback
from anyone interested in the feature.

Rich

View attachment "bytelocale_v1.diff" of type "text/plain" (10222 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.