musl - Re: [PATCH] Properly simplified nextafter()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210811160938.GB13220@brightrain.aerifal.cx>
Date: Wed, 11 Aug 2021 12:09:39 -0400
From: Rich Felker <dalias@...c.org>
To: Stefan Kanthak <stefan.kanthak@...go.de>
Cc: Szabolcs Nagy <nsz@...t70.net>, musl@...ts.openwall.com
Subject: Re: [PATCH] Properly simplified nextafter()

On Wed, Aug 11, 2021 at 05:44:28PM +0200, Stefan Kanthak wrote:
> Rich Felker <dalias@...c.org> wrote:
> 
> > On Wed, Aug 11, 2021 at 12:53:37AM +0200, Stefan Kanthak wrote:
> >> Szabolcs Nagy <nsz@...t70.net> wrote:
> >>
> >>>* Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-10 08:23:46 +0200]:
> >>>> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
> >>>> has quite some superfluous statements:
> >>>>
> >>>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
> >>>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
> >>>>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> >>>
> >>> you replaced 4 int cmps with 4 float cmps (among other things).
> >>
> >> and hinted that the result of the second pair of comparisions is
> >> already known from the first pair.
> >>
> >>> it's target dependent if float compares are fast or not.
> >>
> >> It's also target dependent whether the floating-point registers
> >> can be accessed by integer instructions, or need to be copied:
> >> some win, some loose!
> >> Just let the compiler/optimizer do its job!
> >
> > The values have been copied already to perform isnan,
> 
> NOT necessary: the compiler may have inlined isnan() and perform
> the test for example using FXAM, FUCOM or FUCOMI on i386, or
> UCOMISD on AMD64, without copying the arguments.

static __inline unsigned __FLOAT_BITS(float __f)
{
	union {float __f; unsigned __i;} __u;
	__u.__f = __f;
	return __u.__i;
}

#define isnan(x) ( \
	sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
	sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
	__fpclassifyl(x) == FP_NAN)

So, nope. Unless it's doing some extremely high level rewriting of
this inspection of the representation.

> >> 0. Doesn't musl provide target specific routines for targets with
> >>    soft FP?
> >
> > No, quite the opposite. Targets with hard fp and native insns for
> > particular ops have target-specific versions,
> 
> That's why I assumed that this may also be the case for soft FP.

I don't see how that follows. Targets with soft fp are exactly the
opposite of having a specialized native way to perform the operation;
they do it in the most general way.

> >> The code is of course smaller ... but not as small and fast as a
> >> proper i386 or AMD64 assembly implementation ... which I can
> >> post upon request.
> >
> > Full asm functions are not wanted; it's something we're trying to get
> > rid of in favor of just using very small/single-insn asm statements
> > with proper constraints, where it's sufficiently beneficial to have
> > asm at all. But I'm not even clear how you could make this function
> > more efficient with asm. The overall logic would be exactly the same
> > as the C. Maybe on x86_64 there'd be some SSE instructions to let you
> > elide a few things?
> 
> No, just what the instruction set offers: 23 instructions in 72 bytes.
> 
> nextafter:
>         comisd  xmm1, xmm0              # CF = (from > to)
>         jp      .Lmxcsr                 # from or to INDEFINITE?
>         je      .Lequal                 # from = to?
>         sbb     rdx, rdx                # rdx = (from > to) ? -1 : 0
>         movq    rcx, xmm0               # rcx = from
>         mov     rax, rcx
>         add     rax, rax                # CF = (from & -0.0)
>         jz      .Lzero                  # from = ±0.0?
> ..Lstep:
>         sbb     rax, rax                # rax = (from < 0.0) ? -1 : 0
>         xor     rax, rdx                # rax = (from < 0.0) ^ (from > to) ? -1 : 0
>         or      rax, 1                  # rax = (from < 0.0) ^ (from > to) ? -1 : 1
>         add     rax, rcx                # rax = nextafter(from, to)
>         movq    xmm0, rax               # xmm0 = nextafter(from, to)
>         xorpd   xmm1, xmm1
> ..Lmxcsr:
>         addsd   xmm0, xmm1              # set MXCSR flags
>         ret
> ..Lequal:
>         movsd   xmm0, xmm1              # xmm0 = to
>         ret
> ..Lzero:
>         movmskpd eax, xmm1              # rax = (to & -0.0) ? 0b?1 : 0b?0
>         or      eax, 2                  # rax = (to & -0.0) ? 0b11 : 0b10
>         ror     rax, 1                  # rax = (to & -0.0) ? 0x8000000000000001 : 1
>         movq    xmm0, rax               # xmm0 = (to & -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
>         ret
> 
> GCC generates here at least 12 instructions more, also longer ones,
> including 2 movabs to load 0x8000000000000000 and 0x7FFFFFFFFFFFFFFF,
> so the code is more than 50% fatter, mixes integer SSE and FP SSE
> instructions which incur 2 cycles penalty on many Intel CPUs, with
> WAY TOO MANY not so predictable (un)conditional branches.

We don't use asm to optimize out 2 cycles. If the compiler is choosing
a bad way to perform these loads the compiler should be fixed. But I
don't think it matters in any measurable way in real usage.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.