musl - Re: [ Guidance ] Potential New Routines; Requesting Help

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191230173106.GI30412@brightrain.aerifal.cx>
Date: Mon, 30 Dec 2019 12:31:06 -0500
From: Rich Felker <dalias@...c.org>
To: JeanHeyd Meneide <phdofthehouse@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: [ Guidance ] Potential New Routines; Requesting Help

On Tue, Dec 24, 2019 at 06:06:50PM -0500, JeanHeyd Meneide wrote:
> Dear musl Maintainers and Contributors,
> 
>      I hope this e-mail finds you doing well this Holiday Season! I am
> interested in developing a few fast routines for text encoding for
> musl after the positive reception of a paper for the C Standard
> related to fast conversion routines:
> 
>      https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html

This is interesting, but I'm trying to understand the motivation.

If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the
proposed functions are just the identity (for the c32 ones) and
UTF-16/32 conversion.

If it's not defined, you have the same problem as the current mb/cNN
functions: there's no reason to believe arbitrary Unicode characters
can round-trip through wchar_t any better than they can through
multibyte characters. In fact on such implementations it's likely that
wchar_t meanings are locale-dependent and just a remapping of the
byte/multibyte characters.

What situation do you envision where the proposed functions let you
reliably do something that's not already possible?

>      While I have a basic implementation, I would like to use some
> processor and compiler intrinsics to make it faster and make sure my
> first contribution meets both quality and speed standards for a C
> library.
> 
>      Is there a place in the codebase I can look to for guidance on
> how to handle intrinsics properly within musl libc? If there is
> already infrastructure and common idioms in place, I would rather use
> that then starting to spin up my own.

I'm not sure what you mean by intrinsics or why you're looking for
them but I guess you're thinking of something as a performance
optimization? musl favors having code in straight simple C except when
there's a strong reason (known bottleneck in existing real-world
software -- things like memcpy, strlen, etc.) to do otherwise. The
existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing
so was probably a mistake. The motivation came along with one of the
early motivations for musl: not making UTF-8 a major performance
regression like it was in glibc. But it turned out the bigger issue
was the performance of character-at-a-time and byte-at-a-time
conversions, not bulk conversion.

If we do adopt these functions, the right way to do it would be using
them to refactor the existing c16/c32 functions. Basically, for
example, the bulk of c16rtomb would become c16rtowc, and c16rtomb
would be replaced with a call to c16rtowc followed by wctomb. And the
string ones can all be simple loop wrappers.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.