musl - Re: [PATCH #2] Properly simplified nextafter()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210813155917.GE13220@brightrain.aerifal.cx>
Date: Fri, 13 Aug 2021 11:59:17 -0400
From: Rich Felker <dalias@...c.org>
To: Stefan Kanthak <stefan.kanthak@...go.de>
Cc: Szabolcs Nagy <nsz@...t70.net>, musl@...ts.openwall.com
Subject: Re: [PATCH #2] Properly simplified nextafter()

On Fri, Aug 13, 2021 at 02:04:51PM +0200, Stefan Kanthak wrote:
> Szabolcs Nagy <nsz@...t70.net> wrote on 2021-08-10 at 23:34:
> 
> >* Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-10 08:23:46 +0200]:
> >> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
> >> has quite some superfluous statements:
> >> 
> >> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
> >> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
> >>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> > 
> > you replaced 4 int cmps with 4 float cmps (among other things).
> > 
> > it's target dependent if float compares are fast or not.
> 
> It's also target dependent whether the FP additions and multiplies
> used to raise overflow/underflow are SLOOOWWW: how can you justify
> them, especially for targets using soft-float?

On a target with fenv, raising the exception flags is mandatory and
something must be done to make that happen. Performing the arithmetic
is far faster than prodding at the status flag registers explicitly.

On a target without fenv (all the existing soft float targets), it
should be valid for the compiler just not to emit them, since they
have no side effects. I think we could make FORCE_EVAL expand to
nothing on such targets (if the FE_* macros are not defined). This is
probably worth doing.

(Note that FORCE_EVAL only exists because compilers are buggy and
don't respect the fact that floating point expressions have side
effects even if the result is not used; with a properly working
compiler, it would not be needed, and non-fenv targets would
automatically omit the no-side-effect code.)

> |        /* raise overflow if ux.f is infinite and x is finite */
> |        if (e == 0x7ff)
> |                FORCE_EVAL(x+x);
> |        /* raise underflow if ux.f is subnormal or zero */
> |        if (e == 0)
> |                FORCE_EVAL(x*x + ux.f*ux.f);
> 
> > (the i386 machine where i originally tested this preferred int
> > cmp and float cmp was very slow in the subnormal range
> 
> This also and still holds for i386 FPU fadd/fmul as well as SSE
> addsd/addss/mulss/mulsd additions/multiplies!

It may be possible to reduce the number of such ops too; not sure. But
there's no way to eliminate them.

> Second version:
> 
> --- -/src/math/nextafter.c
> +++ +/src/math/nextafter.c
> @@ -10,13 +10,13 @@
>                 return x + y;
>         if (ux.i == uy.i)
>                 return y;
> -       ax = ux.i & -1ULL/2;
> -       ay = uy.i & -1ULL/2;
> +       ax = ux.i << 2;
> +       ay = uy.i << 2;
>         if (ax == 0) {
>                 if (ay == 0)
>                         return y;
>                 ux.i = (uy.i & 1ULL<<63) | 1;
> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> +       } else if ((ax < ay) == ((int64_t) ux.i < 0))
>                 ux.i--;
>         else
>                 ux.i++;
> 
> For AMD64, GCC generates the following ABSOLUTELY HORRIBLE CRAP
> (the original code compiles even worse):
> 
> 0000000000000000 <nextafter>:
>    0:   48 83 ec 38             sub    $0x38,%rsp
>    4:   0f 29 74 24 20          movaps %xmm6,0x20(%rsp)
>    9:   49 b8 ff ff ff ff ff    movabs $0x7fffffffffffffff,%r8
>   10:   ff ff 7f
>   13:   49 b9 00 00 00 00 00    movabs $0x7ff0000000000000,%r9
>   1a:   00 f0 7f
>   1d:   66 49 0f 7e c2          movq   %xmm0,%r10
>   22:   66 48 0f 7e c2          movq   %xmm0,%rdx
>   27:   66 48 0f 7e c8          movq   %xmm1,%rax
>   2c:   4d 21 c2                and    %r8,%r10
>   2f:   66 48 0f 7e c1          movq   %xmm0,%rcx
>   34:   4d 39 ca                cmp    %r9,%r10
>   37:   0f 87 83 00 00 00       ja     bb <nextafter+0xbb>
>   3d:   49 21 c0                and    %rax,%r8
>   40:   66 49 0f 7e ca          movq   %xmm1,%r10
>   45:   4d 39 c8                cmp    %r9,%r8
>   48:   77 76                   ja     bb <nextafter+0xbb>
>   4a:   66 0f 28 f1             movapd %xmm1,%xmm6
>   4e:   48 39 d0                cmp    %rdx,%rax
>   51:   74 7b                   je     c9 <nextafter+0xc9>
>   53:   66 49 0f 7e c0          movq   %xmm0,%r8
>   58:   48 8d 04 85 00 00 00    lea    0x0(,%rax,4),%rax
>   5f:   00
>   60:   49 c1 e0 02             shl    $0x2,%r8
>   64:   74 7a                   je     db <nextafter+0xd7>
>   66:   49 39 c0                cmp    %rax,%r8
>   69:   66 49 0f 7e c0          movq   %xmm0,%r8
>   6e:   48 8d 42 ff             lea    -0x1(%rdx),%rax
>   72:   41 0f 93 c1             setae  %r9b
>   76:   49 c1 e8 3f             shr    $0x3f,%r8
>   7a:   48 83 c1 01             add    $0x1,%rcx
>   7e:   45 38 c1                cmp    %r8b,%r9b
>   81:   48 0f 44 c1             cmove  %rcx,%rax
>   85:   48 89 c1                mov    %rax,%rcx
>   88:   66 48 0f 6e f0          movq   %rax,%xmm6
>   8d:   48 c1 e9 34             shr    $0x34,%rcx
>   91:   81 e1 ff 07 00 00       and    $0x7ff,%ecx
>   97:   81 f9 ff 07 00 00       cmp    $0x7ff,%ecx
>   9d:   74 61                   je     ef <nextafter+0xef>
>   9f:   85 c9                   test   %ecx,%ecx
>   a1:   75 2b                   jne    c9 <nextafter+0xc9>
>   a3:   66 48 0f 6e c2          movq   %rdx,%xmm0
>   a8:   66 48 0f 6e c8          movq   %rax,%xmm1
>   ad:   f2 0f 59 ce             mulsd  %xmm6,%xmm1
>   b1:   f2 0f 59 c0             mulsd  %xmm0,%xmm0
>   b5:   f2 0f 58 c1             addsd  %xmm1,%xmm0
>   b9:   eb 0e                   jmp    c9 <nextafter+0xc9>
>   bb:   66 48 0f 6e f2          movq   %rdx,%xmm6
>   c0:   66 48 0f 6e d0          movq   %rax,%xmm2
>   c5:   f2 0f 58 f2             addsd  %xmm2,%xmm6
>   c9:   66 0f 28 c6             movapd %xmm6,%xmm0
>   cd:   0f 28 74 24 20          movaps 0x20(%rsp),%xmm6
>   d2:   48 83 c4 38             add    $0x38,%rsp
>   d6:   c3                      retq
>   d7:   48 85 c0                test   %rax,%rax
>   da:   74 e9                   je     c9 <nextafter+0xc9>
>   dc:   48 b8 00 00 00 00 00    movabs $0x8000000000000000,%rax
>   e3:   00 00 80
>   e6:   4c 21 d0                and    %r10,%rax
>   ea:   48 83 c8 01             or     $0x1,%rax
>   ed:   eb 8d                   jmp    85 <nextafter+0x85>
>   ef:   66 48 0f 6e c2          movq   %rdx,%xmm0
>   f4:   f2 0f 58 c0             addsd  %xmm0,%xmm0
>   f8:   eb be                   jmp    c9 <nextafter+0xc9>
> 
> How do you compare these 60 instructions/252 bytes to the code I posted
> (23 instructions/72 bytes)?
> 
> not amused about such HORRIBLE machine code!

Then submit patched to GCC. We are not writing a libc in assembly
language just because of a compiler that can't generate sufficiently
good code.

Have you traced the different execution paths through the above to see
if the cases that matter (non-exceptional ones) are actually terribly
bad? I haven't trawled through it to do this, but that's the first
thing I'd do if I wanted to evaluate it. From there it's easier to see
what's actually going wrong (if anything) and use that to reason about
whether there are reasonable changes to be made to the C code that
would produce better output.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.