
Date: Sun, 23 Apr 2017 11:15:39 0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: [PATCH] math: rewrite fma with mostly int arithmetics On Sun, Apr 23, 2017 at 01:00:52PM +0200, Szabolcs Nagy wrote: > * Rich Felker <dalias@...c.org> [20170422 18:24:25 0400]: > > A few thoughts, inline below. I'm not entirely opposed to this, if it > > turns out to be better than the alternatives, but I would like to > > understand whether it really is... > > > > On Wed, Apr 19, 2017 at 12:41:40AM +0200, Szabolcs Nagy wrote: > > > the freebsd fma code failed to raise underflow exception in some > > > cases in nearest rounding mode (affects fmal too) e.g. > > > > > > fma(0x1p1000, 0x1.000001p74, 0x1p1022) > > > > > > and the inexact exception may be raised spuriously since the fenv > > > is not saved/restored around the exact multiplication algorithm > > > (affects x86 fma too). > > > > Is it difficult to determine when the multiplication part of an fma is > > exact? If you can determine this quickly, you can just return x*y+z in > > this special case and avoid all the costly operations. For normal > > range, I think it's roughly just using ctz to count mantissa bits of x > > and y, and checking whether the sum is <= 53. Some additional handling > > for denormals is needed of course. > > it is a bit more difficult than that: > > bits(a) + bits(b) < 54  (bits(a) + bits(b) == 54 && a*b < 2) > > this is probably possible to handle when i do the int mul. > > however the rounding mode special cases don't get simpler > and inexact flag still may be raised incorrectly when tail > bits of x*y beyond 53 bits are eliminated when z is added > (the result is exact but the dekker algorithm raises inexact). One thing to note: even if it's not a replacement for the whole algorithm, this seems like a very useful optimization for a case that's easy to test. "return x*y+z;" is going to be a lot faster than anything else you can do. But maybe it's rare to hit cases where the optimization works; it certainly "should" be rare if people are using fma for the semantics rather than as a misguided optimization. > > > depends on the a_clz_64 patch and previous scalbn fix. > > > > > > fmal should be possible to do in a similar way. > > > > > > i expect it to be faster than the previous code on most targets as > > > the rounding mode is not changed and has less multiplications > > > (it is faster on x86_64 and i386), the code size is a bit bigger > > > though. > > > > Kinda surprising on i386  I'd expect the 64x64 multiplications to be > > costly compared to float ones. > > > > i implement 64x64 int mul by four 32x32>64 mul, > i386 has 32x32>64 mul op so that works out well. Most archs should have a 32x32>64; if not this is an ISA quality issue and not one I really want to focus on remedying. > the float code has to do dekker's exact multiplication > which uses six(!) fp mul (each of which is probably > slower than an int mul) and a lot more fp add. OK, this makes your approach make a lot more sense! Thanks for sharing that info. > > > /* normalize so top 10bits and last bit are 0 */ > > > struct num nx, ny, nz; > > > nx = normalize(ix); > > > ny = normalize(iy); > > > nz = normalize(iz); > > > > If the only constraint here is that top 10 bits and last bit are 0, I > > don't see why clz is even needed. You can meet this constraint for > > denormals by always multiplying by 2 and using a fixed exponent value. > > yeah that should work, but i also use clz later Ah, I missed that. Still it might be a worthwhile optimization here; I think it shaves off a few ops in normalize(). Rich
Powered by blists  more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.