Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Thu, 10 Nov 2022 09:07:53 +0100
From: Florian Weimer <>
Subject: Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

It has come to my attention that musl uses the range 0xDF80…0xDFFF to
cover the entire byte range:

/* Arbitrary encoding for representing code units instead of characters. */
#define CODEUNIT(c) (0xdfff & (signed char)(c))
#define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)

There is a very similar surrogate character mapping for undecodable
UTF-8 bytes, suggested here:


It uses 0xDC80…0xDCFF.  This has been picked up by various
implementations, including Python.

Is there a reason why musl picked a different surrogate mapping here?
Isn't it similar enough to the UTF-8 hack that it makes sense to pick
the same range?


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.