Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 31 Jan 2019 01:04:26 +0100
From: Szabolcs Nagy <nsz@...t70.net>
To: musl@...ts.openwall.com
Subject: Re: Possible Mistype in exp.c

* Damian McGuckin <damianm@....com.au> [2019-01-30 23:56:05 +1100]:
> As a matter of interest, what was the benchmark against which you get a 2x
> speed gain?

i benchmarked a tight loop around a call where
(1) calls are independent so can be evaluated in parallel
(2) calls depend on the previous iteration result.
(so (1) measures maximum call throughput and (2) measures latency)

usually (1) can be improved significantly over old math code.
and such usage (e.g. calling a single function over an array) is
common (e.g. in fortran code) so rewriting old code is useful.

> I got 1.75 against GLIBC for what that is worth. I used a faster scaling
> routine. But I was not chasing a improved ULP performance ike you were as
> that was too much extra work. Your work there sounds like seriously smart
> stuff to me.

glibc <= 2.26 is old fdlibm and ibm libultim code
glibc 2.27 has single prec improvements
glibc 2.28 has double prec correct rounding slow paths removed
glibc 2.29 will use my code

so glibc version matters when you benchmark.

i did most of my benchmarks on various 64bit arm cores.
(which e.g. always have fma and int rounding instructions,
so some things are designed differently than on x86, but
the isa does not matter too much, the cpu internals matter
more and those seem to be similar across targets)

> I used super-scalar friendly code which adds an extra multiplication. It
> made a miniscule tiny net benefit on the Xeons (not Xeon Gold).
> 
> I had 2 versions of the fast scaling routine replacing ldexp. One used a
> single ternary if/then/else and other grabbed the sign and did a table
> looking which meant one extra multiplication all the time but no branches.
> The one extra multiplication instead of a branch in the 2-line scaling
> routine made no difference.
> 
> I saw a tiny but measurable difference when I used a Xeon with an FMA
> compared to one which did not.
> 
> Discarding the last term in the SUN routine and that net loss of one
> multiplication still made no serious difference to the timing and of course,
> the results were affected.
> 
> My timings showed on my reworked code (for doubles)
> 
> 	21+%	the preliminary comparisons
> 	43+%	polynomial computation super-scalar friendly way
> 	35+%	y = 1 + (x*c/(2-c) - lo + hi);
> 		return k == 0 ? y : scalbn-FAST(y, k);

here the killer is the division really.

scalbn(1+p, k) can be s=2^k; s + s*p; (single fma, just be
careful not to overflow s), but using a table lookup instead
of div should help the most.

> I have slightly increased the work load in the comparisons because I avoid
> pulling 'x' apart into 'hx'. I used only doubles or floats.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.