musl - Re: [ Guidance ] Potential New Routines; Requesting Help

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANHA4OgQGzxkg9X8z8m5iKHdCuStEdbnbk0JrNwHV+m8Qf=XoQ@mail.gmail.com>
Date: Mon, 30 Dec 2019 13:39:10 -0500
From: JeanHeyd Meneide <phdofthehouse@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: [ Guidance ] Potential New Routines; Requesting Help

On Mon, Dec 30, 2019 at 12:31 PM Rich Felker <dalias@...c.org> wrote:
> > ...
> This is interesting, but I'm trying to understand the motivation.
>
> If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the
> proposed functions are just the identity (for the c32 ones) and
> UTF-16/32 conversion.
>
> If it's not defined, you have the same problem as the current mb/cNN
> functions: there's no reason to believe arbitrary Unicode characters
> can round-trip through wchar_t any better than they can through
> multibyte characters. In fact on such implementations it's likely that
> wchar_t meanings are locale-dependent and just a remapping of the
> byte/multibyte characters.

     I'm sorry, I'll try to phrase it as best as I can.

     The issue I and others have with the lack of cNNtowc is that, if
we are to write standards-compliant C, the only way to do that
transformation from, for example, char16_t data to wchar_t portably
is:

     c16rtomb -> multibyte data -> mbrtowc

     The problem with such a conversion sequence is that there are
many legacy encodings and this causes bugs on many user's machines.
Text representable in both char16_t and wchar_t is lost in the middle:
due to the middle not handling it, putting us in a place where we lose
of data going to and from wchar_t to char16_t. This has been
frustrating for a number of users who try to rely on the standard,
only to have to write the above conversions sequence and fail. Thus,
providing a direct function with no intermediates results in a better
Standard C experience.

     A minor but still helpful secondary motivation is in giving
people on certain long-standing platforms a way out. By definition,
UTF16 does not work with wchar_t, so I am explicitly told that wchar_t
for a platform like .e.g Windows is UCS-2 (the non-multi-unit version
of UTF-16 that was deprecated a while ago) is wrong when using the
Standard Library if I want real Unicode Support. Library developers
tell me to rely on platform-specific APIs. The "use
MultiByteToWideChar" or "use ICU" or "use this AIX-specific function",
makes it much less of a Standard way to handle text: hence, the paper
to the WG14 C Committee. The restartable versions of the
single-character functions and the bulk conversion functions give ways
for implementations locked to behaving like the deprecated UCS-2,
16-bit-single-unit-encoding a way out, and also allow us to have
lossless data conversion.

     This reasoning might be a little bit "overdone" for libraries
like musl and glibc who got wchar_t right (thank you!), but part of
standardizing these things means I have to account for implementations
that have been around longer than I have been alive. :) Does that make
sense?

> What situation do you envision where the proposed functions let you
> reliably do something that's not already possible?

     My understanding is that libraries such as musl are "blessed" as
distributions of the Standard Library, and that they can access system
information that makes it possible for them to utilize what the
current "wchar_t encoding" is in a way normal, regular developers
cannot. Specifically, in the generic external implementation I have
been working on, I have a number of #ifdef to check for, say, IBM
machines, then check if they are specifically under zh/tw or even jp
locales, because they deploy a wchar_t in these scenarios that is
neither UTF16 or UTF32 (but instead a flavor of one of the GB
encodings and Japanese encodings); otherwise, IBM uses UTF16/UCS-2 for
wchar_t in i686 and UTF-32 for wchar_t in x86_64 for certain machines.
I also check for what happens on Windows under various settings as
well. Doing this as an external library is hard, because there is no
way I can control the knobs for such reliably, but that a Standard
Library distribution would have access to that information (since they
are providing such functions already).

     So, for example, musl -- being the C library -- controls how the
wchar_t should behave (modulo compiler intervention) for its wide
character functions. Similarly, glibc would know what to do for its
platforms, and IBM would know what to do for its platforms, and so on
and so forth. Each distribution would provide behavior in coordination
with their platform.

    Is this incorrect? Am I assuming a level of standard library <->
vendor relation/cooperation that does not exist?

> >      While I have a basic implementation, I would like to use some
> > processor and compiler intrinsics to make it faster and make sure my
> > first contribution meets both quality and speed standards for a C
> > library.
> >
> >      Is there a place in the codebase I can look to for guidance on
> > how to handle intrinsics properly within musl libc? If there is
> > already infrastructure and common idioms in place, I would rather use
> > that then starting to spin up my own.
>
> I'm not sure what you mean by intrinsics or why you're looking for
> them but I guess you're thinking of something as a performance
> optimization? musl favors having code in straight simple C except when
> there's a strong reason (known bottleneck in existing real-world
> software -- things like memcpy, strlen, etc.) to do otherwise. The
> existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing
> so was probably a mistake. The motivation came along with one of the
> early motivations for musl: not making UTF-8 a major performance
> regression like it was in glibc. But it turned out the bigger issue
> was the performance of character-at-a-time and byte-at-a-time
> conversions, not bulk conversion.

     My experience so far is that the character-at-a-time functions
can cause severe performance penalties for external users, especially
if the library is dynamically linked. If the C standard provides the
bulk-conversion functions, performance would increase drastically for
users desiring bulk conversion (because they do not have to write a
loop around a dynamically-loaded function call to do conversions
one-at-a-time). I am glad that musl has had similar experience, and
would like to make the bulk functions available in musl too!

     My asking about intrinsics and such was that I have some
optimizations using hand-vectorized instructions for some bulk cases.
I will be more than happy to just contribute regular and readable
plain C, though, and then revisit such functions if it turns out that
vectorization with SIMD and other instructions for various platforms
turns out to be worth it. My initial hunch is that it is, but I'm more
than happy to focus on correctness first, extreme performance (maybe)
later.

> If we do adopt these functions, the right way to do it would be using
> them to refactor the existing c16/c32 functions. Basically, for
> example, the bulk of c16rtomb would become c16rtowc, and c16rtomb
> would be replaced with a call to c16rtowc followed by wctomb. And the
> string ones can all be simple loop wrappers.

     I would be more than happy to write the implementation as such!
Most of the wchar_t functions will be very easy since musl and glibc
chose the right wchar_t. (Talking to other vendors is going to be a
much, much more difficult conversation...)

Best Wishes,
JeanHeyd Meneide
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.