Date: Thu, 10 Nov 2022 09:07:53 +0100 From: Florian Weimer <fweimer@...hat.com> To: musl@...ts.openwall.com Subject: Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale It has come to my attention that musl uses the range 0xDF80…0xDFFF to cover the entire byte range: /* Arbitrary encoding for representing code units instead of characters. */ #define CODEUNIT(c) (0xdfff & (signed char)(c)) #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) There is a very similar surrogate character mapping for undecodable UTF-8 bytes, suggested here: <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html> It uses 0xDC80…0xDCFF. This has been picked up by various implementations, including Python. Is there a reason why musl picked a different surrogate mapping here? Isn't it similar enough to the UTF-8 hack that it makes sense to pick the same range? Thanks, Florian
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.