musl - Re: x86[_64] memset and rep stos

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150225172620.GA1414@brightrain.aerifal.cx>
Date: Wed, 25 Feb 2015 12:26:20 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: x86[_64] memset and rep stos

On Wed, Feb 25, 2015 at 10:20:20AM +0100, Szabolcs Nagy wrote:
> * ?????? <torshie@...il.com> [2015-02-25 15:54:31 +0800]:
> > I'm not an expert on micro optimization, but why not use a dynamic
> > routine selection system which would select the optimal routine for a
> > given CPU during program initialization. The routine selection
> > algorithm could simply be a predefined static table look up.
> > IMO, only very small number of functions (like memset, memcpy) would
> > benefit from such a system, so no code size overhead to worry about.
> 
> my guess is
> 
> - for static linking it adds at least an extra indirection
>   and these functions often used with small input

Yes. Actually the extra indirection seems not to be measurable on
modern cpus but it may matter on older cpus, and older cpus are the
only reason to want runtime selection since the optimal strategy is
not changing for modern ones.

> - code size overhead: now you have to include all possible
>   versions in libc.so

Or worse, in a static binary.

> - for dynamic linking there is a load time dispatch mechanism:
>   STT_GNU_IFUNC but it's broken due to lack of specs

It also wouldn't get resolved before libc.so/ld.so needed to call
these functions.

> - maintainance burden, hard to test

This is the main one -- unbounded growth of maintenance and debugging
burden -- how would you test N different versions of the same code
without either having a test system for each cpu model, or extra hacks
to force a particular version of the code to get used even when it
mismatches the host cpu model? Who is responsible for ensuring that
the "optimized" variants for old cpus are actually still faster than
all the new variants? Etc. These issues are not just theoretical;
they're a real world problem that's plagued glibc even though they
have a much bigger team maintaining it. And uclibc basically fell
apart (IMO) because of inability to test and maintain all the
combinations of options (although it was other types of options, not
optimized asm variants).
      
> - selecting the right algorithm at runtime is not easy

Well, it involves the same tedious (but sometimes fun) work Denys and
I have been doing, except doing it with lots of different cpu models
and code variants (i.e. much more work).

> but i guess eventually when more ppl use musl it will make sense
> to add more target specific optimizations

Target-specific, yes, at least for targets that are being used in
places where there are measurable bottlenecks. But I'd still rather
avoid code variants for cpu models within the same ISA. If it's
looking like that kind of thing is needed, perhaps we should try to
model the relevant functions with "GNU C" (with portable fallback of
course, as always) and have the compiler generate the appropriate
variants (i.e. outsourcing trust that the generated code is actually
correct to the compiler) rather than writing multiple asm versions.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.