musl - Re: [ Guidance ] Potential New Routines; Requesting Help

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191230195744.GJ30412@brightrain.aerifal.cx>
Date: Mon, 30 Dec 2019 14:57:44 -0500
From: Rich Felker <dalias@...c.org>
To: JeanHeyd Meneide <phdofthehouse@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: [ Guidance ] Potential New Routines; Requesting Help

On Mon, Dec 30, 2019 at 01:39:10PM -0500, JeanHeyd Meneide wrote:
> On Mon, Dec 30, 2019 at 12:31 PM Rich Felker <dalias@...c.org> wrote:
> > > ...
> > This is interesting, but I'm trying to understand the motivation.
> >
> > If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the
> > proposed functions are just the identity (for the c32 ones) and
> > UTF-16/32 conversion.
> >
> > If it's not defined, you have the same problem as the current mb/cNN
> > functions: there's no reason to believe arbitrary Unicode characters
> > can round-trip through wchar_t any better than they can through
> > multibyte characters. In fact on such implementations it's likely that
> > wchar_t meanings are locale-dependent and just a remapping of the
> > byte/multibyte characters.
> 
>      I'm sorry, I'll try to phrase it as best as I can.
> 
>      The issue I and others have with the lack of cNNtowc is that, if
> we are to write standards-compliant C, the only way to do that
> transformation from, for example, char16_t data to wchar_t portably
> is:
> 
>      c16rtomb -> multibyte data -> mbrtowc
> 
>      The problem with such a conversion sequence is that there are
> many legacy encodings and this causes bugs on many user's machines.
> Text representable in both char16_t and wchar_t is lost in the middle:
> due to the middle not handling it, putting us in a place where we lose
> of data going to and from wchar_t to char16_t. This has been
> frustrating for a number of users who try to rely on the standard,
> only to have to write the above conversions sequence and fail. Thus,
> providing a direct function with no intermediates results in a better
> Standard C experience.
> 
>      A minor but still helpful secondary motivation is in giving
> people on certain long-standing platforms a way out. By definition,
> UTF16 does not work with wchar_t, so I am explicitly told that wchar_t
> for a platform like .e.g Windows is UCS-2 (the non-multi-unit version
> of UTF-16 that was deprecated a while ago) is wrong when using the
> Standard Library if I want real Unicode Support. Library developers
> tell me to rely on platform-specific APIs. The "use
> MultiByteToWideChar" or "use ICU" or "use this AIX-specific function",
> makes it much less of a Standard way to handle text: hence, the paper
> to the WG14 C Committee. The restartable versions of the
> single-character functions and the bulk conversion functions give ways
> for implementations locked to behaving like the deprecated UCS-2,
> 16-bit-single-unit-encoding a way out, and also allow us to have
> lossless data conversion.

I don't think these interfaces gives you an "out" in a way that's
fully conforming. The C model is that there's a set of characters
supported in the current locale, and each of them has one or more
multibyte representations (possibly involving shift states) and a
single wide character representation. Converting between UTF-16 or
UTF-32 and wchar_t outside the scope of characters that exist in the
current locale isn't presently a meaningful concept, and wouldn't
enable you to get meaningful results from wctype.h functions, etc.
(Would you propose having a second set of such functions for char32_t
to handle that? Really it sounds like what you want is an out to
deprecate wchar_t and use char32_t in its place, which wouldn't be a
bad idea...)

Solving these problems for implementations burdened by a legacy *wrong
choice* of definition of wchar_t is not possible by adding more
interfaces alone; it requires a lot of changes to the underlying
abstract model of what a character is in C. I'm not really in favor of
such changes. They complicate and burden existing working
implementations for the sake of ones that made bad choices. Windows in
particular *can* and *should* fix wchar_t to be 32-bit. The Windows
API uses WCHAR, not wchar_t, anyway, so that a change in wchar_t is
really not a big deal for interface compatibility, and has conformance
problems like wprintf treating %s/%ls incorrectly that require
breaking changes to fix. Good stdlib implementations on Windows
already fix these things.

>      This reasoning might be a little bit "overdone" for libraries
> like musl and glibc who got wchar_t right (thank you!), but part of
> standardizing these things means I have to account for implementations
> that have been around longer than I have been alive. :) Does that make
> sense?
> 
> > What situation do you envision where the proposed functions let you
> > reliably do something that's not already possible?
> 
>      My understanding is that libraries such as musl are "blessed" as
> distributions of the Standard Library, and that they can access system
> information that makes it possible for them to utilize what the
> current "wchar_t encoding" is in a way normal, regular developers
> cannot. Specifically, in the generic external implementation I have
> been working on, I have a number of #ifdef to check for, say, IBM
> machines, then check if they are specifically under zh/tw or even jp
> locales, because they deploy a wchar_t in these scenarios that is
> neither UTF16 or UTF32 (but instead a flavor of one of the GB
> encodings and Japanese encodings); otherwise, IBM uses UTF16/UCS-2 for
> wchar_t in i686 and UTF-32 for wchar_t in x86_64 for certain machines.
> I also check for what happens on Windows under various settings as
> well. Doing this as an external library is hard, because there is no
> way I can control the knobs for such reliably, but that a Standard
> Library distribution would have access to that information (since they
> are providing such functions already).

The __STDC_ISO_10646__ macro is the way to determine that the encoding
of wchar_t is Unicode (or some subset if WCHAR_MAX doesn't admit the
full range). Otherwise it's not something you can meaningfully work
with except as an abstract number, but in that case you just want to
avoid it as much as possible and convert directly between multibyte
characters and char16_t/char32_t. I don't see how converting directly
between wchar_t and char16_t/char32_t is more useful, even if it is a
prettier factorization of the code.

A far more useful thing to know than wchar_t encoding is the multibyte
encoding. POSIX gives you this in nl_langinfo(CODESET) but plain C has
no equivalent. I'd actually like to see WG14 adopt this into plain C.

> > >      While I have a basic implementation, I would like to use some
> > > processor and compiler intrinsics to make it faster and make sure my
> > > first contribution meets both quality and speed standards for a C
> > > library.
> > >
> > >      Is there a place in the codebase I can look to for guidance on
> > > how to handle intrinsics properly within musl libc? If there is
> > > already infrastructure and common idioms in place, I would rather use
> > > that then starting to spin up my own.
> >
> > I'm not sure what you mean by intrinsics or why you're looking for
> > them but I guess you're thinking of something as a performance
> > optimization? musl favors having code in straight simple C except when
> > there's a strong reason (known bottleneck in existing real-world
> > software -- things like memcpy, strlen, etc.) to do otherwise. The
> > existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing
> > so was probably a mistake. The motivation came along with one of the
> > early motivations for musl: not making UTF-8 a major performance
> > regression like it was in glibc. But it turned out the bigger issue
> > was the performance of character-at-a-time and byte-at-a-time
> > conversions, not bulk conversion.
> 
>      My experience so far is that the character-at-a-time functions
> can cause severe performance penalties for external users, especially
> if the library is dynamically linked.

On musl (where I'm familiar with performance properties),
byte-at-a-time conversion is roughly half the speed of bulk, which
looks big but is diminishingly so if you're actually doing something
with the result (just converting to wchar_t for its own sake is not
very useful). Character-at-a-time is probably somewhat less slow than
byte-at-a-time. When I wrote this I put in heavy effort to make
byte/character-at-a-time not horribly slow, because it's normally the
natural programming model. Wide character strings are not an idiomatic
type to work with in C.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.