Date: Thu, 10 Nov 2022 09:44:48 -0500 From: Rich Felker <dalias@...c.org> To: Florian Weimer <fweimer@...hat.com> Cc: musl@...ts.openwall.com Subject: Re: Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote: > It has come to my attention that musl uses the range 0xDF80…0xDFFF to > cover the entire byte range: > > /* Arbitrary encoding for representing code units instead of characters. */ > #define CODEUNIT(c) (0xdfff & (signed char)(c)) > #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) > > There is a very similar surrogate character mapping for undecodable > UTF-8 bytes, suggested here: > > <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html> > > It uses 0xDC80…0xDCFF. This has been picked up by various > implementations, including Python. > > Is there a reason why musl picked a different surrogate mapping here? > Isn't it similar enough to the UTF-8 hack that it makes sense to pick > the same range? I'll have to look back through archives to see what the motivations for the particular range were -- I seem to recall there being some. But I think the more important thing here is the *lack* of any motivation to align with anything else. The values here are explicitly *not* intended for use in any sort of information interchange. They're invalid codes that are not Unicode scalar values, and the only reason they exist at all is to make application-internal (or even implementation-internal, in the case of regex/glob/etc.) round-tripping work in the byte-based C locale while avoiding assigning character properties to the bytes or inadvertently handling them in a way that might facilitate pretending they're just latin1. Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes that appeared in a stream expected to be UTF-8" and "bytes of what's expected to be valid UTF-8 being treated bytewise for processing by user request" are related. The proposal you linked is a decent implementation-internal choice for handling data in a binary-clean manner where that's needed (e.g. a text editor operating on files containing a mix of text and binary data or a mix of text encodings), but I think (or at least hope?) that in the years since it was written, there's come to be a consensus that it is *not* a good idea to do this as a "decoding" operation (where the data is saved out as invalid UTF-16 or -32 and used in interchange, as opposed to just internally) because it breaks lots of the good properties of UTF-8. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.