musl - Re: [PATCH #2] Properly simplified nextafter()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210815154843.GH13220@brightrain.aerifal.cx>
Date: Sun, 15 Aug 2021 11:48:44 -0400
From: Rich Felker <dalias@...c.org>
To: Stefan Kanthak <stefan.kanthak@...go.de>
Cc: Szabolcs Nagy <nsz@...t70.net>, musl@...ts.openwall.com
Subject: Re: [PATCH #2] Properly simplified nextafter()

On Sun, Aug 15, 2021 at 05:19:05PM +0200, Stefan Kanthak wrote:
> Szabolcs Nagy <nsz@...t70.net> wrote:
> 
> > * Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-15 09:04:55 +0200]:
> >> Szabolcs Nagy <nsz@...t70.net> wrote:
> >>> you should benchmark, but the second best is to look
> >>> at the longest dependency chain in the hot path and
> >>> add up the instruction latencies.
> >> 
> >> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> >> run 1 against glibc,                         8.58 ns/call
> >> run 2 against musl original,                 3.59
> >> run 3 against musl patched,                  0.52
> >> run 4 the pure floating-point variant from   0.72
> >>       my initial post in this thread,
> >> run 5 the assembly variant I posted.         0.28 ns/call
> >
> > thanks for the numbers. it's not the best measurment
> 
> IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!

The burden of performing a meaningful measurement is on the party who
says there's something that needs to be changed.

> > but shows some interesting effects.
> 
> It clearly shows that musl's current implementation SUCKS, at least
> on AMD64.

Hardly. According to you it's faster than glibc, and looks
sufficiently fast never to be a bottleneck.

> >> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
> >> 
> >> #ifdef PATCH
> >> #define isnan(x) ( \
> >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
> >> __fpclassifyl(x) == FP_NAN)
> >> #else
> >> #define isnan(x) ( \
> >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> >> __fpclassifyl(x) == FP_NAN)
> >> #endif // PATCH
> >
> > i think on x86 this only changes an and to an add
> > (or nothing at all if the compiler is smart)
> 
> BETTER THINK TWICE: where does the mask needed for the and come from?
> Does it need an extra register?
> How do you (for example) build it on ARM?
> 
> > if this is measurable that's an uarch issue of your cpu.
> 
> ARGH: it's not the and that makes the difference!
> 
> JFTR: movabs $0x7ff0000000000000, %r*x is a 10 byte instruction
>       I recommend to read Intel's and AMD's processor optimisation
>       manuals and learn just a little bit!

If you have a general reason (not specific to specific
microarchitectural considerartions) for why one form is preferred,
please state that from the beginning. I don't entirely understand your
argument here since in both the original version and yours, there's a
value on the RHS of the > operator that's in some sense nontrivial to
generate.

Ideally the compiler would be able to emit whichever form is preferred
for the target, since there's a clear transformation that can be made
either direction for this kind of thing. But since that's presently
not the case, if there's a version that can be expected, based on some
reasoning not just "what GCC happens to do", to be faster on most
targets, we should use that.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.