Date: Sun, 21 Jul 2013 03:24:54 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: md5 hash comparisons Sayantan, On Thu, Jul 18, 2013 at 12:03 AM, Sayantan Datta <std2048@...il.com> wrote: > I saw the bitmap code, fundamental idea seems to be translation of a bit > based on hash value. Instead of searching exact values, we do a lookup for > a bit value in the bitmap. It seems to reduce memory use a lot which is > really great. We should use an optimal combination of possibly multiple small bitmaps (similar in concept to but not exactly a Bloom filter), which fit in local memory, and possibly a large bitmap in global memory. Their number and sizes should be such that the false positive rate after all bitmaps have been accessed is very low. In fact, we need to have a reasonable early-reject code path before going to the global memory bitmap (if we use one). It should be reasonable in the sense that most of the time the condition is the same for enough work-items that we can afford having this conditional branch on GPU. We may also have a similarly reasonable even-earlier-reject code path inbetween some local memory bitmap checks, which may let us compute a smaller portion of the hash value most of the time (IIRC, myrice's PG-test code has this). When we get a hit per all bitmaps, we proceed (still on GPU) to check the actual hash value, such as via a hash table in global memory. If we still get a hit on that, we say so to host and it'd need to check if the full hash also matches (the GPU only has partial hashes, although the bitmaps may/should be pre-filled with data from larger and possibly even from full hashes - so that we can do the multiple lookups per hash). Please re-read the messages from last year posted by me and myrice in here. We also had IRC discussions with Bitweasil at the time, who said that his Multiforcer uses the approach with multiple bitmaps - IIRC, it's at least one bitmap in local memory and four 128 MiB bitmaps in global memory. myrice tried implementing this (do take a look at the PG-test branch) and somehow came to the conclusion that 4 small bitmaps in local memory worked well enough (each indexed by a different portion of MD5 hash value), without needing a global memory bitmap. Maybe a reason why myrice got away with local memory bitmaps only (no bitmap in global memory) is because he also implemented a hash table lookup (using global memory), whereas, if I recall Bitweasil's explanation correctly, Multiforcer proceeds with a linear search if the 4*128 MiB global memory bitmaps give a hit. Yet I think we do need bitmap(s) in global memory as well (see below for some numbers). On Thu, Jul 18, 2013 at 12:27:18AM +0530, Sayantan Datta wrote: > But does this really help filtering out most hashes ? By most hahses I mean > at least more than 90% otherwise we may not see desired > performance benefits due to extra branching and extra code. "More than 90%" is not good enough. It should be multiple 9's. > Also 64K loaded hahses seems to fit well within local memory using bitmaps No, even 64K is too much to fit in one local memory bitmap (e.g., you'd achieve a 78% reject rate with a single 32 KiB bitmap). The solution is in use of multiple bitmaps, indexed by different portions of the hash value. (Alternatively, you could use a Bloom filter - that is, just one bitmap indexed multiple times by the different portions of hash value - but it would be suboptimal when its size and the number of lookups are fixed rather than tuned based on number of loaded hashes, and besides we care about the number of lookups too - not only about the false positive rate and the size.) > but 16M wont fit into local memory even with bitmaps. However chances of > rejection is higher with 16M bitmap. Yes, for 16M you need global memory - even the multiple bitmaps or Bloom filter approach in local memory doesn't work at all. (Maybe we should include a not-data-dependent conditional branch to skip the local bitmaps and go straight to global memory when the number of loaded hashes is this large. We'd need to tune the threshold.) However, if you consider e.g. 64K, then with four 8 KiB local memory bitmaps (32 KiB total) we get an 84% or 85% reject rate (depending on whether your 64K is 65536 or 64000). This is an improvement over 78%, albeit not a great improvement. For the remaining 15%+ you need to access a global memory bitmap. Since an entire SIMD unit will have to execute the same code path, the actual efficiency will be a lot less, yet this would measurably reduce the number of global memory accesses as compared to going to global memory unconditionally. Even though most of the ALUs for SIMD vector elements would be wasted while global memory access is performed for a few of them, not performing such access from these wasted ALUs may keep more of the shared resources (buses, caches) available for similar global memory accesses by other SIMD units. With lower numbers of loaded hashes (just a few thousand or less), the multiple bitmaps are a lot more effective. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.