Date: Thu, 17 Jun 2010 03:14:52 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: 1.7.6-jumbo-2 On Tue, Jun 15, 2010 at 09:48:24AM -0400, Erik Winkler wrote: > I noticed the following difference between the patched and unpatched 1.7.6 benchmark for LM hashes on OS X. What does the patch change to cause such a significant difference? The Traditional DES calculation is not affected, so I am not sure of the cause. The patch does not change anything directly relevant. In general, the LM hash code is more susceptible (than code for other hash types) to "spurious" performance changes resulting from program memory layout changes, but those changes are usually within 10%. What you have observed is a bit extreme. I am not able to trigger this. For me, performance is roughly the same with and without the jumbo patch. > Unpatched: > > ../run/john -test=10 -format:LM > Benchmarking: LM DES [128/128 BS SSE2-16]... DONE > Raw: 13021K c/s real, 13034K c/s virtual > > After applying jumbo patch: > > ../run/john -test=10 -format:LM > Benchmarking: LM DES [128/128 BS SSE2-16]... DONE > Raw: 10317K c/s real, 10338K c/s virtual I think that this has nothing to do with the jumbo patch itself, nor with compiler options. The patch simply adds more code and data (elsewhere), which changes the addresses of some pieces of code and data used by the LM hash code. As a result, addresses of a pair of related pieces of code or data may get overlapping cache tags, which will make them "unfriendly" to each other staying in the cache at the same time. Yet I am surprised by the extreme performance difference seen above. It would be customary to see something like this on an Alpha with direct-mapped L1 data cache, but x86 CPUs have at least 2-way set associative L1 caches - and usually 4-way or better. Well, maybe if your specific CPU is only 2-way for the L1 instruction and/or data cache, then this might explain the unstable performance. I recall that it was tricky to obtain stable performance on the original Intel Pentium (in 1990s), which had 2-way L1 caches (8 KB as two 4 KB sets). JtR tries to mitigate issues like this to some extent by combining related pieces of frequently-accessed data into structures (e.g., DES_bs_all for the bitslice DES implementation). Without that, things could be a lot worse. Yet it is possible, say, that a specific sensitive/relevant portion of DES_bs_all "overlaps" with the stack, or/and that DES_bs_set_key_LM() "overlaps" with DES_bs_crypt_LM() as these come from different source files (and thus different object files, the relative placement of which may vary from build to build). A 2-way set associative cache would be enough to deal with these specific examples, but if a third component happens to "overlap" with the two, then we have the problem again. We'd need to study specific addresses or/and use the CPU's performance counters to confirm or disprove this guess as it relates to slowdown seen on specific builds. An easier approach may be to move a pre-built binary that exhibits the problem to another machine - with a CPU with better caches - and see if the problem goes away or stays. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.