musl - multibyte performance findings

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130406052121.GA20915@brightrain.aerifal.cx>
Date: Sat, 6 Apr 2013 01:21:21 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: multibyte performance findings

Hi all,

I've been examining performance in the multibyte conversion functions
(as part of the POSIX locale controversy), and have some interesting
findings so far:

1. Performance of mbrtowc seems to be very sensitive to the compiler's
code generation. Even adding code in untaken branches can drastically
slow down or speed up the overall runtime. In one case, adding a dummy
conditional to mimic locale-dependent encoding actually made the test
run faster. I think this means that before we can draw any conclusions
we need to figure out what's causing the compiler to behave to
wackily, and whether the code can be restructured in such a way that
its performance is less vulnerable to the whims of the compiler.

2. Implementing mbtowc (the old non-restartable function) as a wrapper
for mbrtowc is a bad idea. The interface contract of mbrtowc forces it
to be much slower than desirable; mbtowc's simpler interface can in
theory give much better performance, and based on my first rewrite of
mbtowc, the difference is big -- around 40% faster than the equivalent
mbrtowc calls, and over 50% faster than the wrapper-based mbtowc. This
means all musl-internal use of mbrtowc should probably be replaced by
mbtowc, or perhaps even an internal-use-only function with a better
interface.

3. A significant amount of time is "wasted" checking that the size n
of the input buffer is not exceeded when reading; removing the checks
speeds up mbtowc by 10%. As such, it might be desirable to break the
function into two cases: n>=4 (in which case no further length checks
are needed anywhere) and n<4 (in which case, each additional read
needs a check). Alternately, for mbtowc, perhaps there's a quick and
easy way to check the length against the state mask.

I'm probably going to go ahead and commit some changes that seem to be
clear wins in the above areas, but there's definitely room for
discussion. If anybody's interested in poking around at what's going
on with the optimizer or testing these functions heavily on other cpu
variants, let me know.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.