musl - Re: towlower performance

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170527005950.GA1627@brightrain.aerifal.cx>
Date: Fri, 26 May 2017 20:59:50 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: towlower performance

On Fri, May 26, 2017 at 05:39:31PM -0700, Andre McCurdy wrote:
> On Thu, May 25, 2017 at 11:01 AM, maksis .
> <maksis@...enaline-network.com> wrote:
> > Hi,
> >
> > After switching my C++ app from a glibc-based build to a static build using
> > musl, I've noticed significant differences in performance with certain
> > loading operations, which seems to be related to using std::map with a
> > case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file
> > with a flat listing of ~350k directory items and mapping them by name takes
> > about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort
> > algorithm by removing all calls to towlower, the loading time with musl
> > drops to 3.5 seconds.
> >
> > I put together a C++ benchmark with the same sorting algorithm, which seems
> > to produce similar results (GCC 6.3, 64 bit, -O3):
> >
> > https://gist.github.com/maksis/92ad04f525d69043283350675d04f160
> >
> > glibc: ~2 seconds (Arch Linux)
> >
> > musl: ~7 seconds (Alpine Linux 3.6.0)
> >
> > What might be causing the difference?
> 
> I'm not sure if it maps directly to your results, but when building a
> gcc based musl toolchain, libstdc++ gets configured to use the
> "generic" OS specific include dir, which seems to contain some less
> than optimal ctype code. e.g. ctype_inline.h comes with the following
> warning:
> 
>   // The following definitions are portable, but insanely slow. If one
>   // cares at all about performance, then specialized ctype
>   // functionality should be added for the native os in question: see
>   // the config/os/bits/ctype_*.h files.
> 
>   https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/config/os/generic/ctype_inline.h

That's a bit of an overstatement. While it's possible for some
operations to be slightly to moderately faster with a
pokes-at-libc-internals implementation of c++ locale/ctype
functionality, the portable code is not "insanely" slow. For wchar
functionality I don't think it's even slower at all; as far as I know
glibc only provides function-bypass tables for the 8bit char is*
functions, and for the past couple decades they haven't even fully
bypassed function calls since a function call is used to obtain the
thread-local locale tables.

In any case the portable-interface-based implementation (generic) is
the correct thing to do. musl does not and won't provide backdoors to
bypass that.

Thankfully, though, I think the issue maksis reported is more a matter
of the towlower/towupper implementation being oriented towards size
(we have all the Unicode case mappings plus code to perform them in
about 1k!) rather than performance. A simple fix that would make this
case fast (maybe faster than glibc even) is short-circuiting the
common ascii case, but a more multilingual-friendly solution might be
using some sort of O(1) first-stage table to find the relevant parts
of the mapping tables instead of using linear searches and multiple
special-case branches up front. I suspect we can make the code very
fast while keeping the size under 3k and probably less.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.