john-dev - Re: Birthday paradox

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230425151552.GA29035@openwall.com>
Date: Tue, 25 Apr 2023 17:15:52 +0200
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Birthday paradox

On Tue, Apr 25, 2023 at 04:24:19PM +0200, magnum wrote:
> On 2023-04-23 13:08, Solar Designer wrote:
> >I think we should switch to using floating-point inside that function so
> >that we do not have to explicitly lower precision (...)
> This (fixed now in bleeding) helped a lot.  Here's a few data points 
> (this all on nvidia 2080ti):

Great.

> - Probing the table directly is faster than, or about as fast as, using 
> the bitmap - up to about 500 hashes.  For a single (1) hash, it's 4-5% 
> faster than best seen with bitmap - even when the tables are in global 
> memory (nearly same performance, caches doing their job).

My expectation, from bcrypt performance, is that you'll find a greater
difference between memory types on AMD.

> At about 5K hashes, probing directly gets significantly slower than the 
> bitmap, when the tables no longer fit in local memory.  Above that, the 
> bitmap clearly wins, even where the bitmap no longer fits in local and 
> even with poor parameter choices (high FP rates).
> 
> Even in the upper end of the spectrum (10M hashes in this case), the 
> tables were largely outperformed by nearly any bitmap, even with 1/15 fp 
> rate.  I'm a bit surprised by this, as probing directly is less work and 
> fewer loads from global - the cache situation should be horrible in both 
> cases.

Yes, that's surprising.

> - I've seen at least one case where k=3 is much better than the old 
> bitmaps:  At 1.5M hashes I got 10891Mp/s with m=16777216 and k=3 whereas 
> bleeding-jumbo produced 6686Mp/s using m=8388608 and k=1.
> k=5 or 6 was tried but not better than others. k=7 or 8 was never picked 
> by my (varying) algorithms so far.
> 
> - I have yet to see the old bitmap code outperform the new in any 
> discouraging way.  It sometimes wins, but not by far and I'm nowhere 
> near finishing my "algo".  Endless testing ahead.

Cool.  Meanwhile, I recalled a drawback of Bloom filter vs. separate
bitmaps: you can't remove items from a Bloom filter on its own (without
host memory overhead to store extra info to allow for that, or
regenerating).  I guess the current code didn't bother removing from the
separate bitmaps anyway?  Ideally, we'd enhance the code to do that so
that c/s and p/s (but of course not C/s) would improve as hashes get
cracked and removed.  In the single lookup bitmaps that we use on host
in cracker.c for CPU formats, we do remove cracked hashes from there.

Given that our Bloom filters are intended for local memory and are thus
tiny in terms of host memory, you can easily also store them as separate
bitmaps on host to allow for efficient item removal.  Just update the
affected elements in the individual bitmaps, then regenerate the
corresponding elements in the Bloom filter (not the full filter) from
the bitmaps.

You could also want to experiment with smaller hashes aka fingerprints
stored in the perfect hash tables in local memory, so that more would
fit.  This is basically how cuckoo filter differs from cuckoo hashing.
And you can easily and quickly remove items from there (leaving a hole,
so it's not as optimal as a smaller new table/filter could be, but is
better than not removing).

By smaller hashes to use as fingerprints, I mean even down to 8-bit or
less if you also use the full tables in global memory as a second step.
With a typical (4, 2) cuckoo filter, 8-bit would mean about 1/32, which
is comparable to what you say you're often getting with a Bloom filter
now.  With perfect hash tables, it should be better yet - even as good
as about 1/256 maybe?

I write "about" because you also need to mark/handle empty slots, which
in the trivial implementation wastes one magic fingerprint value.(*)
When a truncated hash happens to match this value, you have to store a
different value as the fingerprint instead, and that different value
unfortunately becomes twice more frequent than the rest, which lowers
efficiency slightly.  Another reason for "about" is that a cuckoo filter
wouldn't be 100% full - optimal is 98% full for (4, 2) - which on one
hand lowers the FP rate, but on the other means there's size overhead.

(*) Something I managed to avoid in pwqfilter with much more complex
trickery that is not applicable for tiny early-reject filters that we
need here.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.