```Date: Wed, 24 Jun 2015 01:13:12 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH 2/5] dynlink.c: compute modulus via magic
multiplication

On Wed, Jun 24, 2015 at 07:32:56AM +0300, Alexander Monakov wrote:
> On Wed, 24 Jun 2015, Rich Felker wrote:
>
> > On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote:
> > > +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
> > > +{
> > > +	if (!(div&(div-1)))
> > > +		return x&(div-1);
> > > +	uint32_t v = x;
> > > +	if (p->s1) v >>= p->s1;
> > > +	else if (v != ~0u) v += p->inc;
> > > +	int s32=32, s2=p->s2;
> > > +	if (sizeof(long) == 8)
> > > +		s32+=s2, s2=0;
> > > +	v = (1ull * v * p->mul) >> s32;
> > > +	v >>= s2;
> > > +	return x-v*div;
> > > +}
> >
> > I think having the div argument here is a pessimization.
>
> How can I do the last step, 'x-v*div' without it?

Ah yes. Would it be preferable to have that in struct udiv then,
though? Then the caller never has to care about the divisor, and in
the case where umod doesn't get inlined, the call will be more
efficient (fewer args).

> > Power-of-two sizes are unlikely to be used,
>
> probably so, but...
>
> > and it's possible for precompute_udiv
> > to setup the struct udiv to work perfectly fine for powers of two,
>
> I gave it a thought, and I didn't find how to do that either.

Originally I thought this would work:

p->s1 = ctz(div);
p->inc = 0;
p->mul = 1;
p->s2 = 0;

But the >>32 of course breaks it. What about:

p->s1 = 0
p->inc = 0;
p->mul = 0x80000000;
p->s2 = ctz(div)-1;

This should handle all values except 1, which probably isn't needed,
but maybe there's a special case that works for 1 too...? I think the
following works for all values except 0xffffffff:

p->s1 = 0
p->inc = 1;
p->mul = 0xffffffff;
p->s2 = 0;

Using the post-mul add rather than the saturated increment would make
this work for 0xffffffff too.

Another obvious solution is not using the +32 offset so that the right
shift can just be 31, but that pessimizes the code on 32-bit archs
quite a bit.

> > so that there is a single code path with no branches. The p->s1 branch
> > should not be needed either.
>
> p->s1 check is for this reason: shift are relatively costly, and s1 is rarely
> non-zero, so try to skip the shift if possible; in the rare case the pre-shift
> is non-zero, the check allows to skip the saturating increment operation.

Shifts are costly? Are we talking about P4-era junk? ;-)

Rich
```