Date: Sat, 26 Mar 2011 09:55:22 -0500 From: "jfoug" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: RE: john scalability >From: magnum > >Someone please explain briefly what those different size tables are >about. What happens without this patch, that make such a huge >performance hit? How come it affects inital loading too? I looked at the >code and found sizes and thresholds but I didn't get any wiser. There are several (many actually) places where hash tables are needed. This cuts down lookup times. If you have 500 candidates, and your hash function is random (enough), and you have 5 hash buckets, then on 'average' there will be 10 items in each hash bucket, and no average, you will look through 1/2 of them for success (which may be rare), and look through all of them on non-success (often very common). This makes for checking each value to take quite a lot of time (tesing against 10 candidates on average). However, the memory usage is small (only 5 hash buckets). Now, if you had the same 500 candidates, but had 50 million hash buckets, and the hashing function again was random, then it would be very likely that no bucket contained more than 1 hash value, and almost ALL of them contain zero. This makes for much quicker searching. However, in this case, you waste a BUNCH of memory. During load time in john, there are some hash tables build. These may be driven by data in the params.h file. Solar may have to fill in some of the details, I am sort of sketchy here. I know that if you do NOT have a proper salt_hash function built in your format, then loading can get very slow. This function is used by the loader, and not during the run of john (I think). During the run of john, there is a hash table used to check computed hashes against possible candidates. If we had 1 million candidates we would NOT want to check our computed hash value against all 1 million of them. We would want to as quickly as possible narrow that search down to none, or just a few candidates. This is where the hashing comes in. When the loader loads the cipher file, all of the want to find hashes are there. The loader then builds a hash table using those values. Then when a candidate is run, the result is put through the same hash function, and only the wanted values which match that hash (if any) are looked at. To do this, (as john is laid out), there are an array of hash generation functions in the format (binary_hash) which compute hashes found in the cipher text file (used by loader), and there is an array of functions which are used at run time (get_hash) which john calls after the crypt_all. The hash generated by each of these 'pair' of functions MUST be the same, for the same password. The reason for a pair of them, ist aht binary_hash can be written to do things like ascii/hex/base64 to binary conversion, etc, while the get_hash can be written to perform no conversion, but simply use the binary data (thus run much faster). The speed of binary_hash is of little consequence (within reason). The speed of get_hash should be fast, since it will be called billions of times. Now, when john starts, it determines how many unique candidates it has. It then computes which of the 5 sized hash table can be used. NOTE, it may be constrained by a format only having 3 hash table sizes that was the original number of hashing functions, before I 'optionally' expanded this to 5 a couple years back. There is a table in params.h which specifies when to bump up to the next table size (based on candidate count). There used to be 3 hash tables. I found serious slow-downs that started at a couple thousand candidates, and went totally to pot when doing 350k or so. I bumped that up to 5, and Solar has since put that into the jumbo. However, at the time, I did not know how to get the proper hash values for some of the formats. Thus, I made the hash size selection code smart enough to check to see if the format had function pointers that were not null. Thus, if a format only had 3 hash functions, then the john loader (and cracker) would only use up to 3 hash table sizes. It now sounds like some of them I 'left out' need to be added (they ALL should be added), The faster the hash, the more important it is to check zero or just a very small few candidates. If the format is very slow, then checking more will not add up to a large percentage of runtime. However, for saltless single crypt hashes (like rawmd5, rawsha, single-salted, lm, ntlm, etc), checking as few candidates as possible speeds things up a lot. Now, Solar has posted that he plans on upping this from 5 to 7 hash levels. That may speed stuff up with we get REALLY large data sets. However, there is also memory caching issues that need to be looked at, when we do up the size. Hash tables are notoriously bad on expanding the memory 'working set', and causing page faults, due to a large number of memory pages which are only accessed occasionally. Going up to 256M hash table entries may cause certain systems to really start to memory thrash. However, I am sure he will look into these type issues when implementing this (along with setting up some logic in the --save-memory type settings) I am sure if I do not have this logic right, Solar will correct my mistakes. But even it I do not have everything listed exactly right, it is close enough that it should give you the insight as to why it make such a huge impact. Jim.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.