
Date: Tue, 17 Mar 2015 08:45:15 +0300 From: Solar Designer <solar@...nwall.com> To: johndev@...ts.openwall.com Subject: Re: 256/128 bit integer arithmatic On Mon, Mar 09, 2015 at 06:36:36PM +0530, Sayantan Datta wrote: > On Mon, Mar 9, 2015 at 3:57 PM, Solar Designer <solar@...nwall.com> wrote: > > On Mon, Mar 09, 2015 at 02:24:49PM +0530, Sayantan Datta wrote: > > > I have my own emulation based on 64bit uint in HI, LO configuration for > > > performing modulo operations specific to my requirements and it's much > > > faster than native 128bit modulo operations. > > > > What do you mean by native here? gcc's __uint128_t? > > I tested against unsigned __int128. Are they any different from __uint128_t > from performance or portability perspective? I'd expect them to be the same performance. As to portability, I don't know. __uint128_t worked for me even on gcc 3.4.5. > > > I tried to do the three 64bit > > > modulo operations required for emulating 128bit modulo using avx > > intrinsics > > > > Why use SIMD for this? Did you want to compute multiple instances of > > the modulo division in parallel (and does your task have the parallelism > > for this to make sense)? Anyway, there's no SIMD integer division on x86. > > If we're talking about overall parallelism of the task, then I think there > is enough parallelism to start with, but it diminishes as the task > progresses. Initially, it is possible to do 8 or more '128(depending on > hash type) bit modulo 32 bit divisions' in parallel which progressively > diminishes to 2 modulo divisions in parallel, at which point the task > almost 80  95 % complete. The remaining 5  15 % is sequential but it's > very fast, thanks to the algorithm itself. > > If we're talking about individual modulo operations, then doing '128bit > modulo 32bit division' requires three 64 bit modulo division(actually 4, > but in my case the 32bit number is fixed, so it can be optimized), among > which two can be parallel. All of this sounds like a wrong way to use SIMD. You're trying to find parallelism within one instance of the task, instead of bringing multiple instances (e.g., multiple computation of perfect hash table indices) to instruction and SIMD level. See this presentation on wrong vs. right ways to use SIMD: https://deplinenoise.wordpress.com/2015/03/06/slidessimdatinsomniacgamesgdc2015/ > > > but it turns out there are no integer division built_ins in gcc, let > > alone > > > modulo operations. No wonder modulo operations are slow!! > > > > I'm not sure what you're referring to, and why you're blaming gcc. This > > shows that gcc is able to produce widening integer multiply instructions > > (and I guess the corresponding division instructions as well, although > > that would need to be tested separately) without needing an intrinsic: > > Searching the Intel's intrinsic guide: > https://software.intel.com/sites/landingpage/IntrinsicsGuide/ > > It shows there are intrinsics for integer modulo and normal division. > Search for keywords : 'rem' and 'div'. > But I'm skeptical about their performance as they have quite high latency > and unless we queue up quite a few instructions, the latencies would be > exposed. But, I'd like to test them anyway!! > <https://software.intel.com/sites/landingpage/IntrinsicsGuide/> Of course. My point was that gcc generates the same instructions anyway, without you having to use intrinsics. > The task is to build a hashtable which requires exactly 2 memory operations > per lookup. It is very easy to do a lookups that takes exactly 2 memory > accesses, hence no control divergence, hence a good fit for GPUs. Also the > space required for the hashtable is O(n). But building them is quite > intensive. Building a hash table for 10M hashes takes 30 sec while 100M > might take 10 minutes to build. Significant amount of the time is spent on > doing modulo operations, which cannot be done using simple & operators. "Exactly 2 memory lookups" and "no control divergence" may sound good, but it's not necessarily the best we can do. With a simple bitmap and hash table(s), if you make the bitmap large enough, you may have very close to 1 memory lookup on average. Now, the worst case is very high, and there's control divergence. However, this only means that some lanes will be execmask'ed out, and won't actually make their memory accesses (to the hash tables or linked list entries in hash buckets, if the corresponding bitmap bit was zero). With a large enough bitmap, you can probably have less than 2 memory accesses by the worst lane in a SIMD unit on average. This will run faster than your perfect hash table. Add to that the complexity and slowness of the hash function into your perfect table, and the difference becomes even higher in favor of a simple bitmap with a simple index function producing occasional collisions. So I don't expect your idea to work that well overall, even if you do get past the integer division hurdles. I'd expect the control divergence of a traditional approach to cause enough of a slowdown for your approach to become beneficial only if one or both of the following are true:  The SIMD vector is very wide (beyond what we have in current GPUs).  The amount of memory is insufficient for a reasonably sized bitmap for the given number of hashes loaded. Please feel free to try, though. I might be wrong. Alexander
Powered by blists  more mailing lists