Date: Wed, 11 Aug 2021 17:44:28 +0200 From: "Stefan Kanthak" <stefan.kanthak@...go.de> To: "Rich Felker" <dalias@...c.org> Cc: "Szabolcs Nagy" <nsz@...t70.net>, <musl@...ts.openwall.com> Subject: Re: [PATCH] Properly simplified nextafter() Rich Felker <dalias@...c.org> wrote: > On Wed, Aug 11, 2021 at 12:53:37AM +0200, Stefan Kanthak wrote: >> Szabolcs Nagy <nsz@...t70.net> wrote: >> >>>* Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-10 08:23:46 +0200]: >>>> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c> >>>> has quite some superfluous statements: >>>> >>>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|; >>>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to >>>> (ax == 0) && (ay == 0): the latter 2 tests can be removed; >>> >>> you replaced 4 int cmps with 4 float cmps (among other things). >> >> and hinted that the result of the second pair of comparisions is >> already known from the first pair. >> >>> it's target dependent if float compares are fast or not. >> >> It's also target dependent whether the floating-point registers >> can be accessed by integer instructions, or need to be copied: >> some win, some loose! >> Just let the compiler/optimizer do its job! > > The values have been copied already to perform isnan, NOT necessary: the compiler may have inlined isnan() and perform the test for example using FXAM, FUCOM or FUCOMI on i386, or UCOMISD on AMD64, without copying the arguments. I recommend to inspect the code GCC generates for AMD64, for example. > so continuing to access them does not incur any further cost. Non sequitur: see above. [...] >> 0. Doesn't musl provide target specific routines for targets with >> soft FP? > > No, quite the opposite. Targets with hard fp and native insns for > particular ops have target-specific versions, That's why I assumed that this may also be the case for soft FP. > but in general musl strongly prefers use of common implementation > across all targets when there is not an obvious [nearly-]single-insn > candidate for a specialized version. That's one of the reason why I submitted this patch: FP hardware is mainstream. >> 1. If not: the compiler knows the target ABI and SHOULD generate >> the proper integer comparisions there. > > Here it would require the compiler to recognize that the nan case was > already ruled out, and to special-case ±0 comparison on the > representation. Of course this is possible in theory, but it's almost > surely not happening now or any time soon. I'm pretty sure soft float > targets just end up calling the libgcc function for floating point > comparison if you do that. | if (isnan(x) || isnan(y)) | return x + y; The 4 instructions I mentioned above set flags for all cases: see below. >> The code is of course smaller ... but not as small and fast as a >> proper i386 or AMD64 assembly implementation ... which I can >> post upon request. > > Full asm functions are not wanted; it's something we're trying to get > rid of in favor of just using very small/single-insn asm statements > with proper constraints, where it's sufficiently beneficial to have > asm at all. But I'm not even clear how you could make this function > more efficient with asm. The overall logic would be exactly the same > as the C. Maybe on x86_64 there'd be some SSE instructions to let you > elide a few things? No, just what the instruction set offers: 23 instructions in 72 bytes. nextafter: comisd xmm1, xmm0 # CF = (from > to) jp .Lmxcsr # from or to INDEFINITE? je .Lequal # from = to? sbb rdx, rdx # rdx = (from > to) ? -1 : 0 movq rcx, xmm0 # rcx = from mov rax, rcx add rax, rax # CF = (from & -0.0) jz .Lzero # from = ±0.0? .Lstep: sbb rax, rax # rax = (from < 0.0) ? -1 : 0 xor rax, rdx # rax = (from < 0.0) ^ (from > to) ? -1 : 0 or rax, 1 # rax = (from < 0.0) ^ (from > to) ? -1 : 1 add rax, rcx # rax = nextafter(from, to) movq xmm0, rax # xmm0 = nextafter(from, to) xorpd xmm1, xmm1 .Lmxcsr: addsd xmm0, xmm1 # set MXCSR flags ret .Lequal: movsd xmm0, xmm1 # xmm0 = to ret .Lzero: movmskpd eax, xmm1 # rax = (to & -0.0) ? 0b?1 : 0b?0 or eax, 2 # rax = (to & -0.0) ? 0b11 : 0b10 ror rax, 1 # rax = (to & -0.0) ? 0x8000000000000001 : 1 movq xmm0, rax # xmm0 = (to & -0.0) ? -0x1.0p-1074 : 0x1.0p-1074 ret GCC generates here at least 12 instructions more, also longer ones, including 2 movabs to load 0x8000000000000000 and 0x7FFFFFFFFFFFFFFF, so the code is more than 50% fatter, mixes integer SSE and FP SSE instructions which incur 2 cycles penalty on many Intel CPUs, with WAY TOO MANY not so predictable (un)conditional branches. JFTR: it's almost always easy to beat the compiler! Stefan
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.