musl - Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <875yfn5j1i.fsf@oldenburg.str.redhat.com>
Date: Thu, 10 Nov 2022 09:07:53 +0100
From: Florian Weimer <fweimer@...hat.com>
To: musl@...ts.openwall.com
Subject: Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

It has come to my attention that musl uses the range 0xDF80…0xDFFF to
cover the entire byte range:

/* Arbitrary encoding for representing code units instead of characters. */
#define CODEUNIT(c) (0xdfff & (signed char)(c))
#define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)

There is a very similar surrogate character mapping for undecodable
UTF-8 bytes, suggested here:

  <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>

It uses 0xDC80…0xDCFF.  This has been picked up by various
implementations, including Python.

Is there a reason why musl picked a different surrogate mapping here?
Isn't it similar enough to the UTF-8 hack that it makes sense to pick
the same range?

Thanks,
Florian

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.