musl - Re: Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20221111153812.GK29905@brightrain.aerifal.cx>
Date: Fri, 11 Nov 2022 10:38:12 -0500
From: Rich Felker <dalias@...c.org>
To: Florian Weimer <fweimer@...hat.com>
Cc: musl@...ts.openwall.com
Subject: Re: Choice of wchar_t mapping for non-ASCII bytes in the
 POSIX locale

On Fri, Nov 11, 2022 at 04:02:23PM +0100, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote:
> >> It has come to my attention that musl uses the range 0xDF80…0xDFFF to
> >> cover the entire byte range:
> >> 
> >> /* Arbitrary encoding for representing code units instead of characters. */
> >> #define CODEUNIT(c) (0xdfff & (signed char)(c))
> >> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)
> >> 
> >> There is a very similar surrogate character mapping for undecodable
> >> UTF-8 bytes, suggested here:
> >> 
> >>   <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>
> >> 
> >> It uses 0xDC80…0xDCFF.  This has been picked up by various
> >> implementations, including Python.
> >> 
> >> Is there a reason why musl picked a different surrogate mapping here?
> >> Isn't it similar enough to the UTF-8 hack that it makes sense to pick
> >> the same range?
> >
> > I'll have to look back through archives to see what the motivations
> > for the particular range were -- I seem to recall there being some.
> > But I think the more important thing here is the *lack* of any
> > motivation to align with anything else. The values here are explicitly
> > *not* intended for use in any sort of information interchange. They're
> > invalid codes that are not Unicode scalar values, and the only reason
> > they exist at all is to make application-internal (or even
> > implementation-internal, in the case of regex/glob/etc.)
> > round-tripping work in the byte-based C locale while avoiding
> > assigning character properties to the bytes or inadvertently handling
> > them in a way that might facilitate pretending they're just latin1.
> 
> For glibc, we are doing this because POSIX requires this for the C
> (POSIX) locale.  It's now required to use a single-byte character set
> with wchar_t mappings for all bytes.  Previously, I had hoped to
> transition to UTF-8 by default (possibly with a surrogate-escape
> encoding like Python's).

Yes, that's entirely my fault and I'm so sorry. I reported a bug where
an interface's spec was ambiguous because they hadn't considered the
possibility that the C locale might be multibyte, and rather than fix
it, all the old-timers freaked out something they were taking for
granted (that the C locale would be byte-based) wasn't actually
specified.

> I guess as an alternative, we could just use the Latin-1 mapping.  Why
> hasn't musl done this?  Because it would promote the idea that the world
> is Latin-1?

Exactly. musl has always been very intentional about not supporting
legacy m17n-incompatible encodings and that character identity under
musl is not locale-specific. So, when we got stuck having to do a
byte-based C locale because of the above unfortunate outcome, what we
strived for was a way to express "these are code units of UTF-8 being
processed as individual bytes for a workflow where the user wants to
operate on bytes".

> > Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes
> > that appeared in a stream expected to be UTF-8" and "bytes of what's
> > expected to be valid UTF-8 being treated bytewise for processing by
> > user request" are related.
> 
> I think those two are fairly similar?  But “fake single-byte character
> set due to POSIX mandate” is different?

They admit the same mechanism and yes they at least have
"similarities", but the problems themselves are somewhat different, I
think. And the former has lots of weird likely unwanted behaviors,
like decode(concat(a,b)) != concat(decode(a),decode(b)) that arise
from the mapping only being taken in the 'error path' rather than
applied to all data uniformly.

Regardless of whether there's a technical reason DF80... is better
than DC80..., I think I'd generally be disinclined to change anything
now. Not because I want to preserve an existing mapping that nothing
should be relying on, but because the only practical motivation for a
change would be to align the mapping for interchange purposes -- which
means, even if we say "this is explicitly not for interchange
purposes", to anyone reading the change it clearly is for interchange
purposes because that's the only effect, and thereby, we might as well
be saying "go ahead and use this for interchange purposes!"

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.