john-dev - Re: Birthday paradox

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230418211231.GA4992@openwall.com>
Date: Tue, 18 Apr 2023 23:12:31 +0200
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Birthday paradox

On Tue, Apr 18, 2023 at 10:31:04PM +0200, magnum wrote:
> On 2023-04-18 21:44, Solar Designer wrote:
> >However, perfect hash tables, which we already use as the final level in
> >the code that you refer to, are an even further step in the same
> >direction, needing exactly 1 lookup.  Given that we already went for the
> >complexity there, is there a reason we're not doing the same for the
> >smaller early-reject tables in local memory?  Just fill them with tiny
> >fingerprints like a Cuckoo filter would use.
> 
> Assuming by "lookup" you mean the actual hash tables lookup, and by 
> "early-reject" you mean the bitmap...

By "lookup" I mean one random access to whatever memory, for whatever
purpose.  However, for the above paragraph your way of reading it is
also right.

> Are you saing we should build an 
> early-reject table more similar to the actual hash table?

Yes, I am saying we could at least try either that or at least a cuckoo
filter.  I'm very familiar with cuckoo filters now, but not yet with
perfect hash tables.  So a cuckoo filter would be easier for me, but we
also already have perfect hash tables due to Sayantan's contribution and
could just try to duplicate their usage at this earlier step and with
small fingerprints (cuckoo filter alike) instead of full hashes.  The
difference is 1 (perfect) vs. up to 2 lookups (cuckoo) per early-reject.

I wonder, would mixing the two concepts be original research?  Similar
to how first there was cuckoo hashing, and later cuckoo filter.  I'm not
aware of people taking a similar step from perfect hashing to let's call
it perfect filter yet, even though it seems obvious.

> I'm merely trying to understand all bits and pieces involved. The bitmap 
> is always "a single lookup", but it can consist of up to eight reads of 
> (hopefully local) memory:

I'd call that 8 lookups.

> ---8<--------------8<--------------8<--------------8<-----------
>     uint bitmap_index = hash[1] & BITMAP_MASK; 
> 
>     uint tmp = (bitmaps[BITMAP_SHIFT * 0 + (bitmap_index >> 5)] >> 
> (bitmap_index & 31)) & 1U;
> #if SELECT_CMP_STEPS == 2
>     bitmap_index = hash[0] & BITMAP_MASK; 
> 
>     tmp &= (bitmaps[BITMAP_SHIFT * 1 + (bitmap_index >> 5)] >> 
> (bitmap_index & 31)) & 1U;
> #endif
> #if SELECT_CMP_STEPS == 4
>     (...)
> ---8<--------------8<--------------8<--------------8<-----------
> 
> The BITMAP_SHIFT macro and the occasional multiplication are all 
> compile-time constants so should be optimized away. Then each bitmap 
> step is just an add, a couple shifts and a couple ANDs.

So BITMAP_SHIFT makes this separate 8 bitmaps.  You could instead try
removing the shifts and having each lookup cover the entire array
(larger mask).  That would be a Bloom filter.  You could then experiment
with lookup count, as it would be untangled from the size.  Of course,
all of this would need to be consistently replicated during bitmap
construction on host.  You might very well achieve better results than
we have now.  Instead of experimenting, assuming that local memory is of
uniform access speed, you could simply use a Bloom filter parameters vs.
FP rate calculator (such JavaScript web pages exist) to simulate this.
Those web page simulators might not care about actual average lookup
count, though - so you might need to write your own simulator - should
be quick and easy.

However, knowing that cuckoo and perfect are more efficient, the above
could be a waste of effort unless you expect to stay with it for a long
time (such as for lack of R&D time to proceed further).

> Here's the code for a hash table look-up once we passed the bitmap test:
> 
> ---8<--------------8<--------------8<--------------8<-----------
>     uint t, hash_table_index; 
> 
>     ulong hash64; 
> 
> 
> 
>     hash64 = ((ulong)hash[1] << 32) | (ulong)hash[0]; 
> 
>     hash64 += (ulong)offset_table[hash64 % OFFSET_TABLE_SIZE]; 
> 
>     hash_table_index = hash64 % HASH_TABLE_SIZE; 
> 
> 
> 
>     if (hash_table[hash_table_index] == hash[0] && 
> 
>         hash_table[hash_table_index + HASH_TABLE_SIZE] == hash[1]) { 
> 
>             /* we've got a crack */
> ---8<--------------8<--------------8<--------------8<-----------
> 
> Like we discussed recently, the modulo operations are compile-time 
> constants so should be fairly optimized - but perhaps they are still 
> relatively expensive?

If optimized by the compiler, their latency is probably that of two
consecutive integer multiplies plus a little more.  We could help the
compiler optimize further (removing that "a little more" part) by doing
the reciprocals explicitly targeting the 31-bit range if/when it's
sufficient for us.  It costs a few more operations to support the full
32-bit range, as I figured out and explained on that recent GitHub PR
comments.  However, the performance difference is probably small.

We could also verify whether the compiler optimizes this.  One way would
be to look at the ISA.  Another would be to use a runtime variable for
the divisor and see whether and how much slower it would be - in the
full program or/and in an isolated test case.

Oh!  I just realized that we're dividing a 64-bit number here.  My tests
and reasoning above were for 32-bit to 32-bit (or to 31-bit).  It should
be most costly for 64-bit, although conceptually the same kind of
optimizations would apply (needing 128-bit or maybe 96-bit temporarily,
which is probably still faster than a division).

OTOH, if the input were 32-bit, then the modulo division would introduce
significant biases.  This could be fine for a perfect hash table (not
sure, though), but it'd certainly hurt efficiency of a cuckoo filter
(assuming uniform memory access speed; with caching, not so obvious).

> I tried disabling the bitmap completely, to see 
> how bad it would be, and speed plummeted.  I've been meaning to re-try 
> that but with the offset and hash tables in local memory (when they fit 
> of course). But if I recall correctly, I even tried replacing the bitmap 
> lookup (for a single hash loaded) with a constant compare [if (hash[1] = 
> 0xcafebabe)] and even *that* did not run any faster than the current 
> bitmap code. Not sure why really.

That's puzzling.  Maybe performance is dominated by random global memory
access throughput (as in random lookups per second).

> Anyway I've mostly been tinkering with "many" hashes and how to optimize 
> that. One idea I've had is for the case where eg. a 4x bitmap doesn't 
> fit in local memory: Perhaps the first (or first two) steps can be in 
> local, and the rest in global?

If global memory were not cached, this would make no sense - it'd be
more optimal to do the last-level hash table lookup right away.
However, with caching it could be different - you'd move those steps not
from local to global, but from local hopefully mostly to cache, and it
could theoretically run faster on some devices.

> Well, many ideas, lots of basic research, slow progress.

The most obvious and easiest change relative to what we have is to try
combining those bitmaps into one Bloom filter.  This isn't expected to
be the most optimal long-term solution, but if it turns out to be faster
than what we have now, it'd provide a new baseline for further attempts.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.