john-dev - Re: Birthday paradox

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <be8824a41601dfadd598aff54a79c08f@smtp.hushmail.com>
Date: Tue, 18 Apr 2023 22:31:04 +0200
From: magnum <magnumripper@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Birthday paradox

On 2023-04-18 21:44, Solar Designer wrote:
> On Tue, Apr 18, 2023 at 07:29:36PM +0200, Solar Designer wrote:
>> On Tue, Apr 18, 2023 at 07:12:09PM +0200, magnum wrote:
>>> Background: I was amazed how an 4-level bitmap that "should" give 1/16
>>> false positives, could end up as good as 1/43.  That was with 512 hashes
>>> in a 4x1024 bitmap. Using this formula we end up with 1/41, so mystery
>>> solved.
>>>
>>> In another test case, 1024 hashes in a 8x1024 bitmap would be "full" and
>>> should give 1/1 false positives if we don't factor in the birthday
>>> paradox.  But empirical tests showed an amazing 1/38 - actually good
>>> enough to make an nvidia perform near its best!  This was again
>>> explained with the birthday paradox (which says 1/39).
>>
>> Now you need to compare efficiency of these multiple bitmaps against one
>> Bloom filter of their cumulative size.
> (...)
>> And then against a Cuckoo filter of the same size.
> 
> This is the best of both worlds: not only even lower FP rate per size
> than Bloom filter, but also at most 2 lookups.
> 
> However, perfect hash tables, which we already use as the final level in
> the code that you refer to, are an even further step in the same
> direction, needing exactly 1 lookup.  Given that we already went for the
> complexity there, is there a reason we're not doing the same for the
> smaller early-reject tables in local memory?  Just fill them with tiny
> fingerprints like a Cuckoo filter would use.

Assuming by "lookup" you mean the actual hash tables lookup, and by 
"early-reject" you mean the bitmap... Are you saing we should build an 
early-reject table more similar to the actual hash table?

I'm merely trying to understand all bits and pieces involved. The bitmap 
is always "a single lookup", but it can consist of up to eight reads of 
(hopefully local) memory:

---8<--------------8<--------------8<--------------8<-----------
     uint bitmap_index = hash[1] & BITMAP_MASK; 

     uint tmp = (bitmaps[BITMAP_SHIFT * 0 + (bitmap_index >> 5)] >> 
(bitmap_index & 31)) & 1U;
#if SELECT_CMP_STEPS == 2
     bitmap_index = hash[0] & BITMAP_MASK; 

     tmp &= (bitmaps[BITMAP_SHIFT * 1 + (bitmap_index >> 5)] >> 
(bitmap_index & 31)) & 1U;
#endif
#if SELECT_CMP_STEPS == 4
     (...)
---8<--------------8<--------------8<--------------8<-----------

The BITMAP_SHIFT macro and the occasional multiplication are all 
compile-time constants so should be optimized away. Then each bitmap 
step is just an add, a couple shifts and a couple ANDs.

Here's the code for a hash table look-up once we passed the bitmap test:

---8<--------------8<--------------8<--------------8<-----------
     uint t, hash_table_index; 

     ulong hash64; 

 

     hash64 = ((ulong)hash[1] << 32) | (ulong)hash[0]; 

     hash64 += (ulong)offset_table[hash64 % OFFSET_TABLE_SIZE]; 

     hash_table_index = hash64 % HASH_TABLE_SIZE; 

 

     if (hash_table[hash_table_index] == hash[0] && 

         hash_table[hash_table_index + HASH_TABLE_SIZE] == hash[1]) { 

             /* we've got a crack */
---8<--------------8<--------------8<--------------8<-----------

Like we discussed recently, the modulo operations are compile-time 
constants so should be fairly optimized - but perhaps they are still 
relatively expensive?  I tried disabling the bitmap completely, to see 
how bad it would be, and speed plummeted.  I've been meaning to re-try 
that but with the offset and hash tables in local memory (when they fit 
of course). But if I recall correctly, I even tried replacing the bitmap 
lookup (for a single hash loaded) with a constant compare [if (hash[1] = 
0xcafebabe)] and even *that* did not run any faster than the current 
bitmap code. Not sure why really.

Anyway I've mostly been tinkering with "many" hashes and how to optimize 
that. One idea I've had is for the case where eg. a 4x bitmap doesn't 
fit in local memory: Perhaps the first (or first two) steps can be in 
local, and the rest in global?

Well, many ideas, lots of basic research, slow progress.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.