
Date: Mon, 24 Apr 2017 00:34:48 +0200 From: Szabolcs Nagy <nsz@...t70.net> To: musl@...ts.openwall.com Subject: Re: [PATCH] math: rewrite fma with mostly int arithmetics * Rich Felker <dalias@...c.org> [20170423 11:15:39 0400]: > On Sun, Apr 23, 2017 at 01:00:52PM +0200, Szabolcs Nagy wrote: > > * Rich Felker <dalias@...c.org> [20170422 18:24:25 0400]: > > > Is it difficult to determine when the multiplication part of an fma is > > > exact? If you can determine this quickly, you can just return x*y+z in > > > this special case and avoid all the costly operations. For normal > > > range, I think it's roughly just using ctz to count mantissa bits of x > > > and y, and checking whether the sum is <= 53. Some additional handling > > > for denormals is needed of course. > > > > it is a bit more difficult than that: > > > > bits(a) + bits(b) < 54  (bits(a) + bits(b) == 54 && a*b < 2) > > > > this is probably possible to handle when i do the int mul. > > > > however the rounding mode special cases don't get simpler > > and inexact flag still may be raised incorrectly when tail > > bits of x*y beyond 53 bits are eliminated when z is added > > (the result is exact but the dekker algorithm raises inexact). > > One thing to note: even if it's not a replacement for the whole > algorithm, this seems like a very useful optimization for a case > that's easy to test. "return x*y+z;" is going to be a lot faster than > anything else you can do. But maybe it's rare to hit cases where the > optimization works; it certainly "should" be rare if people are using > fma for the semantics rather than as a misguided optimization. i didn't see a simple way to check for exact x*y result (if it were easy then that could capture the exact 0 result case which means one less special case later, but this is not easy if x*y is in the subnormal range or overflows) > > > If the only constraint here is that top 10 bits and last bit are 0, I > > > don't see why clz is even needed. You can meet this constraint for > > > denormals by always multiplying by 2 and using a fixed exponent value. > > > > yeah that should work, but i also use clz later > > Ah, I missed that. Still it might be a worthwhile optimization here; I > think it shaves off a few ops in normalize(). attached a new version with updated normalize. on my laptop latency and code size: old x86_64: 67 ns/call 893 bytes new x86_64: 20 ns/call 960 bytes old i386: 80 ns/call 942 bytes new i386: 75 ns/call 1871 bytes old arm:  960 bytes new arm:  1200 bytes
Powered by blists  more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.