musl - Re: Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87v8nlo7pc.fsf@oldenburg.str.redhat.com>
Date: Fri, 11 Nov 2022 16:02:23 +0100
From: Florian Weimer <fweimer@...hat.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: Choice of wchar_t mapping for non-ASCII bytes in the
 POSIX locale

* Rich Felker:

> On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote:
>> It has come to my attention that musl uses the range 0xDF80…0xDFFF to
>> cover the entire byte range:
>> 
>> /* Arbitrary encoding for representing code units instead of characters. */
>> #define CODEUNIT(c) (0xdfff & (signed char)(c))
>> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)
>> 
>> There is a very similar surrogate character mapping for undecodable
>> UTF-8 bytes, suggested here:
>> 
>>   <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>
>> 
>> It uses 0xDC80…0xDCFF.  This has been picked up by various
>> implementations, including Python.
>> 
>> Is there a reason why musl picked a different surrogate mapping here?
>> Isn't it similar enough to the UTF-8 hack that it makes sense to pick
>> the same range?
>
> I'll have to look back through archives to see what the motivations
> for the particular range were -- I seem to recall there being some.
> But I think the more important thing here is the *lack* of any
> motivation to align with anything else. The values here are explicitly
> *not* intended for use in any sort of information interchange. They're
> invalid codes that are not Unicode scalar values, and the only reason
> they exist at all is to make application-internal (or even
> implementation-internal, in the case of regex/glob/etc.)
> round-tripping work in the byte-based C locale while avoiding
> assigning character properties to the bytes or inadvertently handling
> them in a way that might facilitate pretending they're just latin1.

For glibc, we are doing this because POSIX requires this for the C
(POSIX) locale.  It's now required to use a single-byte character set
with wchar_t mappings for all bytes.  Previously, I had hoped to
transition to UTF-8 by default (possibly with a surrogate-escape
encoding like Python's).

I guess as an alternative, we could just use the Latin-1 mapping.  Why
hasn't musl done this?  Because it would promote the idea that the world
is Latin-1?

> Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes
> that appeared in a stream expected to be UTF-8" and "bytes of what's
> expected to be valid UTF-8 being treated bytewise for processing by
> user request" are related.

I think those two are fairly similar?  But “fake single-byte character
set due to POSIX mandate” is different?

Thanks,
Florian

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.