
Date: Mon, 9 Mar 2015 18:36:36 +0530
From: Sayantan Datta <std2048@...il.com>
To: johndev <johndev@...ts.openwall.com>
Subject: Re: 256/128 bit integer arithmatic
Hi
On Mon, Mar 9, 2015 at 3:57 PM, Solar Designer <solar@...nwall.com> wrote:
> On Mon, Mar 09, 2015 at 02:24:49PM +0530, Sayantan Datta wrote:
> > I have my own emulation based on 64bit uint in HI, LO configuration for
> > performing modulo operations specific to my requirements and it's much
> > faster than native 128bit modulo operations.
>
> What do you mean by native here? gcc's __uint128_t?
>
I tested against unsigned __int128. Are they any different from __uint128_t
from performance or portability perspective?
>
> > I tried to do the three 64bit
> > modulo operations required for emulating 128bit modulo using avx
> intrinsics
>
> Why use SIMD for this? Did you want to compute multiple instances of
> the modulo division in parallel (and does your task have the parallelism
> for this to make sense)? Anyway, there's no SIMD integer division on x86.
>
If we're talking about overall parallelism of the task, then I think there
is enough parallelism to start with, but it diminishes as the task
progresses. Initially, it is possible to do 8 or more '128(depending on
hash type) bit modulo 32 bit divisions' in parallel which progressively
diminishes to 2 modulo divisions in parallel, at which point the task
almost 80  95 % complete. The remaining 5  15 % is sequential but it's
very fast, thanks to the algorithm itself.
If we're talking about individual modulo operations, then doing '128bit
modulo 32bit division' requires three 64 bit modulo division(actually 4,
but in my case the 32bit number is fixed, so it can be optimized), among
which two can be parallel.
>
> > but it turns out there are no integer division built_ins in gcc, let
> alone
> > modulo operations. No wonder modulo operations are slow!!
>
> I'm not sure what you're referring to, and why you're blaming gcc. This
> shows that gcc is able to produce widening integer multiply instructions
> (and I guess the corresponding division instructions as well, although
> that would need to be tested separately) without needing an intrinsic:
>
Searching the Intel's intrinsic guide:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
It shows there are intrinsics for integer modulo and normal division.
Search for keywords : 'rem' and 'div'.
But I'm skeptical about their performance as they have quite high latency
and unless we queue up quite a few instructions, the latencies would be
exposed. But, I'd like to test them anyway!!
<https://software.intel.com/sites/landingpage/IntrinsicsGuide/>
>
>
> http://stackoverflow.com/questions/13187629/gccintrinsicforextendeddivisionmultiplication/13187798#13187798
>
> Also relevant:
>
>
> http://stackoverflow.com/questions/16822757/sseintegerdivision/16830506#16830506
> http://libdivide.com
>
> I think we shouldn't depend on external libraries that we can easily
> avoid, though.
>
> Sayantan, you haven't mentioned what you're trying to accomplish, but I
> think you should redefine the task. For example, you probably don't
> actually need 128bit precision to divide a keyspace.
>
> Alexander
The task is to build a hashtable which requires exactly 2 memory operations
per lookup. It is very easy to do a lookups that takes exactly 2 memory
accesses, hence no control divergence, hence a good fit for GPUs. Also the
space required for the hashtable is O(n). But building them is quite
intensive. Building a hash table for 10M hashes takes 30 sec while 100M
might take 10 minutes to build. Significant amount of the time is spent on
doing modulo operations, which cannot be done using simple & operators.
Regards,
Sayantan
Content of type "text/html" skipped
Powered by blists  more mailing lists