john-dev - Re: md5 hash comparisons

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130720232454.GA8382@openwall.com>
Date: Sun, 21 Jul 2013 03:24:54 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: md5 hash comparisons

Sayantan,

On Thu, Jul 18, 2013 at 12:03 AM, Sayantan Datta <std2048@...il.com> wrote:
> I saw the bitmap code, fundamental idea seems to be translation of a bit
> based on hash value. Instead of searching exact values, we do a lookup for
> a bit value in the bitmap. It seems to reduce memory use a lot which is
> really great.

We should use an optimal combination of possibly multiple small bitmaps
(similar in concept to but not exactly a Bloom filter), which fit in
local memory, and possibly a large bitmap in global memory.  Their
number and sizes should be such that the false positive rate after all
bitmaps have been accessed is very low.  In fact, we need to have a
reasonable early-reject code path before going to the global memory
bitmap (if we use one).  It should be reasonable in the sense that most
of the time the condition is the same for enough work-items that we can
afford having this conditional branch on GPU.  We may also have a
similarly reasonable even-earlier-reject code path inbetween some local
memory bitmap checks, which may let us compute a smaller portion of the
hash value most of the time (IIRC, myrice's PG-test code has this).

When we get a hit per all bitmaps, we proceed (still on GPU) to check
the actual hash value, such as via a hash table in global memory.  If we
still get a hit on that, we say so to host and it'd need to check if the
full hash also matches (the GPU only has partial hashes, although the
bitmaps may/should be pre-filled with data from larger and possibly even
from full hashes - so that we can do the multiple lookups per hash).

Please re-read the messages from last year posted by me and myrice in
here.  We also had IRC discussions with Bitweasil at the time, who said
that his Multiforcer uses the approach with multiple bitmaps - IIRC,
it's at least one bitmap in local memory and four 128 MiB bitmaps in
global memory.

myrice tried implementing this (do take a look at the PG-test branch)
and somehow came to the conclusion that 4 small bitmaps in local memory
worked well enough (each indexed by a different portion of MD5 hash
value), without needing a global memory bitmap.  Maybe a reason why
myrice got away with local memory bitmaps only (no bitmap in global
memory) is because he also implemented a hash table lookup (using global
memory), whereas, if I recall Bitweasil's explanation correctly,
Multiforcer proceeds with a linear search if the 4*128 MiB global memory
bitmaps give a hit.  Yet I think we do need bitmap(s) in global memory
as well (see below for some numbers).

On Thu, Jul 18, 2013 at 12:27:18AM +0530, Sayantan Datta wrote:
> But does this really help filtering out most hashes ? By most hahses I mean
> at least more than 90% otherwise we may not see desired
> performance benefits due to extra branching and extra code.

"More than 90%" is not good enough.  It should be multiple 9's.

> Also 64K loaded hahses seems to fit well within local memory using bitmaps

No, even 64K is too much to fit in one local memory bitmap (e.g., you'd
achieve a 78% reject rate with a single 32 KiB bitmap).  The solution
is in use of multiple bitmaps, indexed by different portions of the hash
value.  (Alternatively, you could use a Bloom filter - that is, just one
bitmap indexed multiple times by the different portions of hash value -
but it would be suboptimal when its size and the number of lookups are
fixed rather than tuned based on number of loaded hashes, and besides we
care about the number of lookups too - not only about the false positive
rate and the size.)

> but 16M wont fit into local memory even with bitmaps. However chances of
> rejection is higher with 16M bitmap.

Yes, for 16M you need global memory - even the multiple bitmaps or Bloom
filter approach in local memory doesn't work at all.  (Maybe we should
include a not-data-dependent conditional branch to skip the local
bitmaps and go straight to global memory when the number of loaded
hashes is this large.  We'd need to tune the threshold.)

However, if you consider e.g. 64K, then with four 8 KiB local memory
bitmaps (32 KiB total) we get an 84% or 85% reject rate (depending on
whether your 64K is 65536 or 64000).  This is an improvement over 78%,
albeit not a great improvement.  For the remaining 15%+ you need to
access a global memory bitmap.

Since an entire SIMD unit will have to execute the same code path, the
actual efficiency will be a lot less, yet this would measurably reduce
the number of global memory accesses as compared to going to global
memory unconditionally.  Even though most of the ALUs for SIMD vector
elements would be wasted while global memory access is performed for a
few of them, not performing such access from these wasted ALUs may
keep more of the shared resources (buses, caches) available for similar
global memory accesses by other SIMD units.

With lower numbers of loaded hashes (just a few thousand or less), the
multiple bitmaps are a lot more effective.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.