Date: Sun, 15 Aug 2021 11:48:44 -0400 From: Rich Felker <dalias@...c.org> To: Stefan Kanthak <stefan.kanthak@...go.de> Cc: Szabolcs Nagy <nsz@...t70.net>, musl@...ts.openwall.com Subject: Re: [PATCH #2] Properly simplified nextafter() On Sun, Aug 15, 2021 at 05:19:05PM +0200, Stefan Kanthak wrote: > Szabolcs Nagy <nsz@...t70.net> wrote: > > > * Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-15 09:04:55 +0200]: > >> Szabolcs Nagy <nsz@...t70.net> wrote: > >>> you should benchmark, but the second best is to look > >>> at the longest dependency chain in the hot path and > >>> add up the instruction latencies. > >> > >> 1 billion calls to nextafter(), with random from, and to either 0 or +INF: > >> run 1 against glibc, 8.58 ns/call > >> run 2 against musl original, 3.59 > >> run 3 against musl patched, 0.52 > >> run 4 the pure floating-point variant from 0.72 > >> my initial post in this thread, > >> run 5 the assembly variant I posted. 0.28 ns/call > > > > thanks for the numbers. it's not the best measurment > > IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT! The burden of performing a meaningful measurement is on the party who says there's something that needs to be changed. > > but shows some interesting effects. > > It clearly shows that musl's current implementation SUCKS, at least > on AMD64. Hardly. According to you it's faster than glibc, and looks sufficiently fast never to be a bottleneck. > >> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is > >> > >> #ifdef PATCH > >> #define isnan(x) ( \ > >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \ > >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \ > >> __fpclassifyl(x) == FP_NAN) > >> #else > >> #define isnan(x) ( \ > >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \ > >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \ > >> __fpclassifyl(x) == FP_NAN) > >> #endif // PATCH > > > > i think on x86 this only changes an and to an add > > (or nothing at all if the compiler is smart) > > BETTER THINK TWICE: where does the mask needed for the and come from? > Does it need an extra register? > How do you (for example) build it on ARM? > > > if this is measurable that's an uarch issue of your cpu. > > ARGH: it's not the and that makes the difference! > > JFTR: movabs $0x7ff0000000000000, %r*x is a 10 byte instruction > I recommend to read Intel's and AMD's processor optimisation > manuals and learn just a little bit! If you have a general reason (not specific to specific microarchitectural considerartions) for why one form is preferred, please state that from the beginning. I don't entirely understand your argument here since in both the original version and yours, there's a value on the RHS of the > operator that's in some sense nontrivial to generate. Ideally the compiler would be able to emit whichever form is preferred for the target, since there's a clear transformation that can be made either direction for this kind of thing. But since that's presently not the case, if there's a version that can be expected, based on some reasoning not just "what GCC happens to do", to be faster on most targets, we should use that. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.