Date: Sat, 21 Jan 2012 22:40:35 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: dropping *_hash_0() magnum, Jim, all - Right now, the *_hash*() functions provided by each format are supposed to return values of these bit sizes: 4, 8, 12, 16, 20, 24, 27 The smallest one of these (4-bit) was relevant in 1990s when RAM sizes were small and John only supported salted hashes. So this helped to check a computed hash value against a few loaded hashes (with that one salt) quicker while not consuming much RAM. Things have changed. The last straw may be my recent introduction of bitmaps, where a bitmap of 16 bits is ridiculous (it is smaller than one element in the array of "unsigned int") and where hash tables (used after bitmap lookups) are even smaller (so far, based on some testing, I set them to have 4 times fewer elements than the bitmaps). A 4-element hash table makes little sense - if we use a hash table now, we can as well make it larger. On the other hand, 27-bit hash values were meant to be sufficient for up to about 100 million of same-salt hashes loaded. Larger values would be potentially problematic. 27 bits corresponded to 1 GB hash table size with 64-bit pointers. The next size would be 2 GB, and I suspect that it could hit limitations of some malloc() implementations. With my introduction of bitmaps, hash tables became smaller - 4 times smaller now. With 100M hashes loaded, we'd get a 27-bit bitmap, but only a 25-bit hash table. The bitmap would be approx. 52.5% full (assuming uniform distribution). This means that more than 50% of the time the smaller hash table would be accessed, and it'd take about 5 iterations to see that the computed hash does not actually match any of those in that hash bucket. This seems suboptimal. A workaround could be to make the bitmap vs. hash table size ratio variable (have it depend on hash count, so that e.g. for 100M hashes we'd have both the bitmap and the hash table use full 27-bit indices), but this may slow the code down a little bit (in all cases, not only when this flexibility is actually needed - unless we use specialized pieces of code). Thus, I propose that we drop 4-bit and add 29-bit hash values (for bitmaps; this corresponds to 27-bit for hash tables). The maximum hash table size will then remain at 1 GB (per salt), although the corresponding threshold would be higher than we'd set it for direct hash tables (without bitmaps). (29 also happens to be the number of significant bits in the first 32-bit half of binary tripcodes with the current code.) This lets us reuse the same sets of 7 hash function pointers that we already have. They will correspond to these return value sizes: 8, 12, 16, 20, 24, 27, 29 I propose that we don't rename any functions. Simply drop *_0() and add *_7(), so the new numbering will start with 1 (although array indices obviously start at 0). Similarly, drop PASSWORD_HASH_SIZE_0 and PASSWORD_HASH_THRESHOLD_0, and add PASSWORD_HASH_SIZE_7 and PASSWORD_HASH_THRESHOLD_7 in params.h. Any comments? If a format is not updated, it will end up wasting memory. To deal with this, we may introduce a new FMT_* flag that will indicate the use of the new sizes - but this flag will soon be set on all formats. Do we need this complexity? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.