Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 9 May 2015 16:05:36 -0400
From: Rich Felker <dalias@...c.org>
To: 罗勇刚(Yonggang Luo)  <luoyonggang@...il.com>
Cc: John Sully <john@...uare.ca>, Karsten Blees <blees@...n.de>,
	musl@...ts.openwall.com, dplakosh@...t.org,
	austin-group-l@...ngroup.org, hsutter@...rosoft.com,
	Clang Dev <cfe-dev@...uiuc.edu>,
	James McNellis <james@...esmcnellis.com>
Subject: Re: Re: [cfe-dev] Is that getting wchar_t to be 32bit on
 win32 a good idea for compatible with Unix world by implement posix layer on
 win32 API?

On Sat, May 09, 2015 at 07:19:14PM +0800, 罗勇刚(Yonggang Luo)  wrote:
> 2015-05-09 18:36 GMT+08:00 Szabolcs Nagy <nsz@...t70.net>:
> > * John Sully <john@...uare.ca> [2015-05-09 00:55:12 -0700]:
> >> In my opinion you almost never want 32-bit wide characters once you learn
> >> of their limitations.  Most people assume that if they use them they can
> >> return to the one character -> one glyph idiom like ASCII.  But Unicode is
> >
> > wchar_t must be at least 21 bits on a system that spports unicode
> > in any locale: it has to be able to represent all code points of the
> > supported character set.
> >
> > in practice this means that the only conforming definition to iso c
> > (and thus posix, c++ and other standards based on c) is a 32bit wchar_t
> > (the signedness can be choosen freely).
> >
> > so the definition is not based on what "you almost never want" or what
> > "most people assume".
> >
> > if the goal is to provide a posix implementation then 16bit wchar_t
> > is not an option (assuming the system wants to be able to communicate
> > with the external world that uses unicode text).
> wchar_t is not the only way to communicate with the external way, and
> it's also not suite for communicate to the external world,

Of course it's not. UTF-8 is. But per both ISO C and POSIX, any
character the locale supports has a representation as wchar_t. If
wchar_t is only 16-bit, then you fundamentally can't support all of
Unicode in the locale's encoding. mbrtowc has to fail with EILSEQ for
4-byte characters, regex functions cannot process 4-byte characters,
etc. Such a system is is conforming to the requirements for C and
POSIX but does not support Unicode (in full) at the locale level.

> from the C11 standard, it's never restrict the wchar_t's width, and
> for Posix, most API are implement in
> utf8, and indeed, Windows need the posix layer mainly because of those
> API that using utf8, not wchar_t APIs,
> for the communicate reason to getting wchar_t to be 32 bit on Win32 is
> not a good idea,
> 
> And for portable text processing(Including win32) apps or libs, they
> would and should never dependents on the wchar_t must be 32 bit width.

If __STDC_ISO_10646__ is defined, wchar_t must have at least 21 value
bits. Applications which are portable only to systems where this macro
is defined, or which have some fallback (like dropping multilingual
text support) for systems where it's not defined, CAN make such
assumptions.

> And C11/C++11 already provide uchar.h to provide cross-platform
> char16_t and char32_t, so there is no reason to getting wchar_t to be
> 32bit
> on win32 for suport posix on win32.

If wchar_t is 16-bit, you can't represent non-BMP characters in
char32_t because they can't be part of the locale's character set. All
char32_t buys you then is 16 wasted zero bits.

> We were intent to creating a usable posix layer on win32, not creating
> a theoretical POSIX layer that would be useless, on win32, we should
> considerate the de facto things
> on win32.

Uselessness is a big assumption you're making that's not supported by
data. If you actually provide a working POSIX layer, you'll have
pretty much any application that's currently working on Linux, BSDs,
etc. (with actual portable code, not system-specific #ifdefs) working
on Windows with few or no changes. If you do that with 32-bit wchar_t,
they'll support Unicode fully. If you do it with 16-bit wchar_t, then
the ones that are using the locale system for character handling will
have to be refitted with extra layers to support more than the BMP,
and those patches probably (hopefully) won't be accepted upstream.

The only applications that would benefit from having 16-bit wchar_t
are existing Windows applications that are not going to have much use
for a POSIX layer anyway, and they can be fixed very easily with
search-and-replace (no new code layers).

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.