Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 22 Apr 2023 23:39:44 +0200
From: Solar Designer <>
Subject: Re: Birthday paradox (Bloom filter, Perfect hash tables)

On Thu, Apr 20, 2023 at 02:15:48PM +0200, magnum wrote:
> I kind of regret having this conversation on the ML instead of Github - 
> the latter is way more convenient.

We had a chance of having more people comment in here, but somehow that
didn't happen.

> I found that Sayantan's perfect hash tables involves randomness - they 
> end up with slightly different sizes.  A Mersenne Twister is seeded from 
> gettimeofday().
> Perhaps differences in speed between runs could be caused by different 
> modulo figures ending up differently optimized?  Or should I expect that 
> such optimizations are more or less uniform given that OFFSET_TABLE_SIZE 
> is (assumably) always a prime?

I'd expect them to be uniform for nearby values that are not powers of 2.

> I suppose we could give it a static seed for uniform test runs.  Perhaps 
> we could even want to commit a static seed?  I assume we'd still need 
> the varying seed for when it has to retry a table build though.

I guess we should be able to use a static seed - if it's literally just
a seed, and when retrying further random numbers would be used computed
from that seed.

> Because of mentioned randomness, 5308 hashes *sometimes* fit in local 
> memory without a bitmap, sometimes not.  But at that number, using local 
> or not almost doesn't affect the speed I get!  Perhaps the freed local 
> memory becomes cache (or caching is otherwise very good) - I recently 
> heard someone mentioning that about modern GPU's.

L1d caches are of similar size to local memory, may be of similar speed
too, and if there are no global memory accesses they're not thrashed.
However, there may be differences in gather load support - I'd expect it
to favor local memory at least on AMD.  When you make those tiny random
lookups concurrently from different SIMT "threads", it's essentially a
gather load instruction if supported for the memory type.

By the way, FP rates like 1/46 might not be good enough if any _one_ of
64 "threads" not rejecting during those early lookups means that
execution has to proceed to the global memory hash table access, and all
will wait for it to complete.  Apparently on newer AMD RDNA this is
reduced from 64 to 32 - that's an improvement.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.