musl - Re: [PATCH #2] Properly simplified nextafter()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4272846f-eb89-2856-af9-38571037a924@dereferenced.org>
Date: Sun, 15 Aug 2021 02:46:58 -0500 (CDT)
From: Ariadne Conill <ariadne@...eferenced.org>
To: musl@...ts.openwall.com
cc: Szabolcs Nagy <nsz@...t70.net>
Subject: Re: [PATCH #2] Properly simplified nextafter()

Hi,

On Sun, 15 Aug 2021, Stefan Kanthak wrote:

> Szabolcs Nagy <nsz@...t70.net> wrote:
>
>> * Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-13 14:04:51 +0200]:
>>> Szabolcs Nagy <nsz@...t70.net> wrote on 2021-08-10 at 23:34:
>
>>>> (the i386 machine where i originally tested this preferred int
>>>> cmp and float cmp was very slow in the subnormal range
>>>
>>> This also and still holds for i386 FPU fadd/fmul as well as SSE
>>> addsd/addss/mulss/mulsd additions/multiplies!
>>
>> they are avoided in the common case, and only used to create
>> fenv side-effects.
>
> Unfortunately but for hard & SOFT-float, where no fenv exists, as
> Rich wrote.

My admittedly rudementary understanding of how soft-float is implemented 
in musl leads me to believe that this doesn't really matter that much.

>>> --- -/src/math/nextafter.c
>>> +++ +/src/math/nextafter.c
>>> @@ -10,13 +10,13 @@
>>>                 return x + y;
>>>         if (ux.i == uy.i)
>>>                 return y;
>>> -       ax = ux.i & -1ULL/2;
>>> -       ay = uy.i & -1ULL/2;
>>> +       ax = ux.i << 2;
>>> +       ay = uy.i << 2;
>>
>> the << 2 looks wrong, the top bit of the exponent is lost.
>
> It IS wrong, but only in the post, not in the code I tested.

So... in other words, you are testing code that is different than the code 
you are submitting?

>
>>>         if (ax == 0) {
>>>                 if (ay == 0)
>>>                         return y;
>>>                 ux.i = (uy.i & 1ULL<<63) | 1;
>>> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
>>> +       } else if ((ax < ay) == ((int64_t) ux.i < 0))
>>>                 ux.i--;
>>>         else
>>>                 ux.i++;
>> ...
>>> How do you compare these 60 instructions/252 bytes to the code I posted
>>> (23 instructions/72 bytes)?
>>
>> you should benchmark, but the second best is to look
>> at the longest dependency chain in the hot path and
>> add up the instruction latencies.
>
> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> run 1 against glibc,                         8.58 ns/call
> run 2 against musl original,                 3.59
> run 3 against musl patched,                  0.52
> run 4 the pure floating-point variant from   0.72
>      my initial post in this thread,
> run 5 the assembly variant I posted.         0.28 ns/call
>
> Now hurry up and patch your slowmotion code!

And how do these benchmarks look on non-x86 architectures, like aarch64 or 
riscv64?

I would rather have a portable math library with functions that cost 3.59 
nsec per call than one where the portable bits are not exercised on x86.

> Stefan
>
> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
>
> #ifdef PATCH
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
> __fpclassifyl(x) == FP_NAN)
> #else
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> __fpclassifyl(x) == FP_NAN)
> #endif // PATCH
>
> PPS: and of course the log from the benchmarks...
>
> [stefan@...e ~]$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             AuthenticAMD
> CPU family:            23
> Model:                 49
> Model name:            AMD EPYC 7262 8-Core Processor
> Stepping:              0
> CPU MHz:               3194.323
> BogoMIPS:              6388.64
> Virtualization:        AMD-V
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              512K
> L3 cache:              16384K
> ...
> [stefan@...e ~]$ gcc --version
> gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

gcc 8 is quite old at this point.  gcc 9 and 10 have much better 
optimizers that are much more capable.

Indeed, on my system with GCC 10.3.1, nextafter() is using SSE2 
instructions on Alpine x86_64, and if I rebuild musl with `-march=znver2` 
it uses AVX instructions for nextafter(), which seems more than 
sufficiently optimized to me.

Speaking in personal capacity only, I would rather musl's math routines 
remain to the point instead of going down the rabbit hole of manually 
optimized routines like glibc has done.  That also includes 
hand-optimizing the C routines to exploit optimal behavior for some 
specific microarch.  GCC already has knowledge of what optimizations are 
good for a specific microarch (this is the whole point of `-march` and 
`-mtune` after all), if something is missing, it should be fixed there.

Ariadne
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.