Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Thu, 10 Nov 2022 09:07:53 +0100
From: Florian Weimer <fweimer@...hat.com>
To: musl@...ts.openwall.com
Subject: Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

It has come to my attention that musl uses the range 0xDF80…0xDFFF to
cover the entire byte range:

/* Arbitrary encoding for representing code units instead of characters. */
#define CODEUNIT(c) (0xdfff & (signed char)(c))
#define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)

There is a very similar surrogate character mapping for undecodable
UTF-8 bytes, suggested here:

  <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>

It uses 0xDC80…0xDCFF.  This has been picked up by various
implementations, including Python.

Is there a reason why musl picked a different surrogate mapping here?
Isn't it similar enough to the UTF-8 hack that it makes sense to pick
the same range?

Thanks,
Florian

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.