Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 29 Sep 2016 11:23:54 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Model specific optimizations?

On Thu, Sep 29, 2016 at 04:21:26PM +0200, Markus Wichmann wrote:
> Hi there,
> 
> I wanted to ask if there is any wish for the near future to support
> model-specific optimizations. What I mean by that is multiple
> implementations of the same function, where the best implementation is
> decided at run-time.

musl's general approach to this is to use non-mandatory ISA extensions
only when compiled with the right -march to assume their availability.

What I mean by "non-mandatory" is that some extensions must be used
when supported in order to have proper behavior. For example if a
baseline ISA does not have atomics (ARM, SH) but emulates them with
help from the kernel that only works on UP machines (SH), code running
on models that support SMP, even if it was just compiled for the
baseline ISA, _must_ detect availability of the atomics and use them
or it would not properly synchronize with other cpus.

Another example is setjmp/longjmp handling of floating point registers
on softfloat ARM ABIs. Per the ABI, if the registers exist, some of
them are defined as call-saved, so setjmp/longjmp have to save and
restore them even when the ABI libc was built for doesn't use them
itself.

It's possible we could also apply runtime detection on a case-by-case
basis for things that are not mandatory but performance-critical;
however, I think we would want to see convincing evidence that the
performance gains would be worth the complexity/maintenance costs.

For most cases, a better solution is probably just to build libc.so
optimized for your machine with the appropriate -march.

> One simple example would be PowerPC's fsqrt instruction. The PowerPC
> Book 1 defines it as optional and provides no way to know specifically,
> if the currently running processor supports this instruction besides
> executing it and seeing if you get a SIGILL.
> 
> A cursory DuckDuckGo search revealed that Apple uses the instruction as
> sqrt implementation if it detects the CPU capability for that, however
> it only detects that capability by checking the PVR for known-good bit
> patterns (Currently, the only known PowerPC cores to support this
> instruction are the 970 and 970FX, which have a version field if 0x39
> and 0x3c, respectively). x86 and -derived architectures at least have
> the cpuid instruction to check for some features, and admittedly,
> there's a lot of defined bits. However, glibc's ifunc-initialization
> function (which selects the implementation) also does a lot of work
> finding out the precise make and model of the CPU to set some more
> flags.
> 
> The reason I ask is that lots of ISAs define optional parts that aren't
> mandatory, but grow in popularity more and more until they're seen in
> all current practical implementations. Like how x87 started out as a
> separate device but is a fixed part of x86 since the later days of the
> 486. Same with MMX, SSE, SSE2. None of these are mandatory by the ABI,
> but available in all practical implementations. And musl is never going
> to be able to utilize that in its current form. Oh, alright, the
> compiler might support it, but that's different.

For vector instructions, generation by the compiler is almost always
the only way we want to be using them. Hand-written vector code is
huge liability in terms of maintenance, readability, bug-surface, etc.
The aim in musl is to have no asm beyond what's absolutely necessary
to run and to make performance-crticial functions (mainly memcpy and
the like) acceptably fast.

> I also suspect the
> fsqrt instruction will be available in more future PowerPC
> implementations.
> 
> If we were to go this route, the question is how to go about it. First
> the detection method: Stuff like cpuid or AT_HWCAP are pretty nice,
> because they allow for the detection of a feature, whereas version
> checking only allows one to find known-good implementations. The latter
> means there's a list of known-good values, and that list has to be kept
> up-to-date. However, the latter is also pretty much always possible,
> while the former isn't always available. The kernel doesn't check for
> fsqrt availability, for example.

What kind of version-checking? Not all systems even give you a way to
version-check.

 
> Then organization: Are we going the glibc route, which gathers all
> indirect functions in a single section and initializes all of the
> pointers at startup (__libc_init_main()), or do we put these checks
> separately in each function?

Branch based on hwcap or similar in the functions themselves.

> To make a practical example, we could implement sqrt() for PowerPC like
> this:
> 
> static double soft_sqrt(double);
> static double hard_sqrt(double);
> static double init_sqrt(double);
> static double (*sqrtfn)(double) = init_sqrt;
> 
> double sqrt(double x) {
>     return sqrtfn(x);
> }
> 
> static double init_sqrt(double x) {
>     unsigned long pvr;
>     unsigned long ver;
>     asm ("mfspr pvr, r0" : "=r"(pvr));
>     ver = (pvr >> 16) & 0xffff;
>     /* XXX: Add more values for cores with the fsqrt instruction here */
>     if (0
>         || ver == 0x39  /* PowerPC 970 */
>         || ver == 0x3c  /* PowerPC 970FX */
>     )
>         sqrtfn = hard_sqrt;
>     else
>         sqrtfn = soft_sqrt;
> 
>     return sqrtfn(x);
> }

This code contains data races. In order to be safe under musl's memory
model, sqrtfn would have to be volatile and should probably be written
via a_cas_p. It also then has to have type void* and be cast to/from
function pointer type. See clock_gettime.c.

> static double hard_sqrt(double x) {
>     double r;
>     asm ("fsqrt %0, %1": "=d"(r) : "d"(x));
>     return r;
> }

For some archs, gas produces an error or tags the .o file as needing a
certain ISA level if you use an instruction that's not present in the
baseline ISA. I'm not sure if this is an issue here or not.

> #define sqrt soft_sqrt
> #include "../sqrt.c"
> 
> 
> Problem with this is: The same thing would have to be repeated for
> sqrtf(), the same list of known values would have to be maintained
> twice, although we could make it a real list (an array, I mean), and get
> rid of that issue. But it does add quite a bit of code, and the overhead
> of an indirect function call, and at the moment isn't going to be useful
> to all but a few people.
> 
> Also, the inclusion here is a hack. But I couldn't think of a better
> way.

I think it's the #define sqrt soft_sqrt that's a hack. The inclusion
itself is okay and would be the right way to do this for sure if it
were just a compile-time check and not a runtime one.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.