Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 30 Dec 2019 12:31:06 -0500
From: Rich Felker <dalias@...c.org>
To: JeanHeyd Meneide <phdofthehouse@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: [ Guidance ] Potential New Routines; Requesting Help

On Tue, Dec 24, 2019 at 06:06:50PM -0500, JeanHeyd Meneide wrote:
> Dear musl Maintainers and Contributors,
> 
>      I hope this e-mail finds you doing well this Holiday Season! I am
> interested in developing a few fast routines for text encoding for
> musl after the positive reception of a paper for the C Standard
> related to fast conversion routines:
> 
>      https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html

This is interesting, but I'm trying to understand the motivation.

If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the
proposed functions are just the identity (for the c32 ones) and
UTF-16/32 conversion.

If it's not defined, you have the same problem as the current mb/cNN
functions: there's no reason to believe arbitrary Unicode characters
can round-trip through wchar_t any better than they can through
multibyte characters. In fact on such implementations it's likely that
wchar_t meanings are locale-dependent and just a remapping of the
byte/multibyte characters.

What situation do you envision where the proposed functions let you
reliably do something that's not already possible?

>      While I have a basic implementation, I would like to use some
> processor and compiler intrinsics to make it faster and make sure my
> first contribution meets both quality and speed standards for a C
> library.
> 
>      Is there a place in the codebase I can look to for guidance on
> how to handle intrinsics properly within musl libc? If there is
> already infrastructure and common idioms in place, I would rather use
> that then starting to spin up my own.

I'm not sure what you mean by intrinsics or why you're looking for
them but I guess you're thinking of something as a performance
optimization? musl favors having code in straight simple C except when
there's a strong reason (known bottleneck in existing real-world
software -- things like memcpy, strlen, etc.) to do otherwise. The
existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing
so was probably a mistake. The motivation came along with one of the
early motivations for musl: not making UTF-8 a major performance
regression like it was in glibc. But it turned out the bigger issue
was the performance of character-at-a-time and byte-at-a-time
conversions, not bulk conversion.

If we do adopt these functions, the right way to do it would be using
them to refactor the existing c16/c32 functions. Basically, for
example, the bulk of c16rtomb would become c16rtowc, and c16rtomb
would be replaced with a call to c16rtowc followed by wctomb. And the
string ones can all be simple loop wrappers.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.