Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 19 Apr 2023 23:47:56 +0200
From: magnum <>
Subject: Re: Birthday paradox

On 2023-04-18 23:12, Solar Designer wrote:
> On Tue, Apr 18, 2023 at 10:31:04PM +0200, magnum wrote:
>> The BITMAP_SHIFT macro and the occasional multiplication are all
>> compile-time constants so should be optimized away. Then each bitmap
>> step is just an add, a couple shifts and a couple ANDs.
> So BITMAP_SHIFT makes this separate 8 bitmaps.  You could instead try
> removing the shifts and having each lookup cover the entire array
> (larger mask).  That would be a Bloom filter.  (...) You might very
> well achieve better results than we have now.

I read up on Bloom filters and tried this - indeed very trivial changes 
make it a traditional k-level Bloom filter.  The performance (actual 
FP's counted by kernel) almost didn't change at all (tried both with 
sparsely and heavily populated bitmaps) and where it ever so slightly 
did, it was sometimes in favor of the original code and sometimes of the 
modified one.  My conclusion is they are very near equivalent, even more 
so than I presumed.

> You could then experiment with lookup count, as it would be untangled
> from the size.

With my debug/research patches I could already experiment freely with 
sizes and lookup counts so this didn't really change anything.  Still, 
I'm leaning towards pushing those bitmap changes.  If nothing else, 
we'll know what to call our bitmaps from then on :)

>> I tried disabling the bitmap completely, to see
>> how bad it would be, and speed plummeted.  I've been meaning to re-try
>> that but with the offset and hash tables in local memory (when they fit
>> of course). But if I recall correctly, I even tried replacing the bitmap
>> lookup (for a single hash loaded) with a constant compare [if (hash[1] =
>> 0xcafebabe)] and even *that* did not run any faster than the current
>> bitmap code. Not sure why really.

I now added code that checks if the offset + hash tables fit into local 
memory and if so, skips the bitmap completely.  This happens up to about 
5300 loaded hashes.  But just like most things I tried, this performs 
just the same as using the bitmap - neither better or worse.

> Maybe performance is dominated by random global memory
> access throughput (as in random lookups per second).

But our reads from global (as in the key buffer) should be amortized 
well by the mask acceleration.  Perhaps I should somehow re-verify that 
the MD4 is actually calculated all with registers but I'd be a bit 
surprised if it's not.

Maybe I should retry much of the above on AMD, and see if the results 
are different.  Heck, I could even try it on this very macbook I'm 
sitting at... I use it almost exclusively as a very expensive terminal 
emulator, lol.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.