
Date: Mon, 9 Mar 2015 18:36:36 +0530 From: Sayantan Datta <std2048@...il.com> To: johndev <johndev@...ts.openwall.com> Subject: Re: 256/128 bit integer arithmatic Hi On Mon, Mar 9, 2015 at 3:57 PM, Solar Designer <solar@...nwall.com> wrote: > On Mon, Mar 09, 2015 at 02:24:49PM +0530, Sayantan Datta wrote: > > I have my own emulation based on 64bit uint in HI, LO configuration for > > performing modulo operations specific to my requirements and it's much > > faster than native 128bit modulo operations. > > What do you mean by native here? gcc's __uint128_t? > I tested against unsigned __int128. Are they any different from __uint128_t from performance or portability perspective? > > > I tried to do the three 64bit > > modulo operations required for emulating 128bit modulo using avx > intrinsics > > Why use SIMD for this? Did you want to compute multiple instances of > the modulo division in parallel (and does your task have the parallelism > for this to make sense)? Anyway, there's no SIMD integer division on x86. > If we're talking about overall parallelism of the task, then I think there is enough parallelism to start with, but it diminishes as the task progresses. Initially, it is possible to do 8 or more '128(depending on hash type) bit modulo 32 bit divisions' in parallel which progressively diminishes to 2 modulo divisions in parallel, at which point the task almost 80  95 % complete. The remaining 5  15 % is sequential but it's very fast, thanks to the algorithm itself. If we're talking about individual modulo operations, then doing '128bit modulo 32bit division' requires three 64 bit modulo division(actually 4, but in my case the 32bit number is fixed, so it can be optimized), among which two can be parallel. > > > but it turns out there are no integer division built_ins in gcc, let > alone > > modulo operations. No wonder modulo operations are slow!! > > I'm not sure what you're referring to, and why you're blaming gcc. This > shows that gcc is able to produce widening integer multiply instructions > (and I guess the corresponding division instructions as well, although > that would need to be tested separately) without needing an intrinsic: > Searching the Intel's intrinsic guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ It shows there are intrinsics for integer modulo and normal division. Search for keywords : 'rem' and 'div'. But I'm skeptical about their performance as they have quite high latency and unless we queue up quite a few instructions, the latencies would be exposed. But, I'd like to test them anyway!! <https://software.intel.com/sites/landingpage/IntrinsicsGuide/> > > > http://stackoverflow.com/questions/13187629/gccintrinsicforextendeddivisionmultiplication/13187798#13187798 > > Also relevant: > > > http://stackoverflow.com/questions/16822757/sseintegerdivision/16830506#16830506 > http://libdivide.com > > I think we shouldn't depend on external libraries that we can easily > avoid, though. > > Sayantan, you haven't mentioned what you're trying to accomplish, but I > think you should redefine the task. For example, you probably don't > actually need 128bit precision to divide a keyspace. > > Alexander The task is to build a hashtable which requires exactly 2 memory operations per lookup. It is very easy to do a lookups that takes exactly 2 memory accesses, hence no control divergence, hence a good fit for GPUs. Also the space required for the hashtable is O(n). But building them is quite intensive. Building a hash table for 10M hashes takes 30 sec while 100M might take 10 minutes to build. Significant amount of the time is spent on doing modulo operations, which cannot be done using simple & operators. Regards, Sayantan [ CONTENT OF TYPE text/html SKIPPED ]
Powered by blists  more mailing lists