john-users - Re: 1.7.6-jumbo-2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100616231452.GA15798@openwall.com>
Date: Thu, 17 Jun 2010 03:14:52 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: 1.7.6-jumbo-2

On Tue, Jun 15, 2010 at 09:48:24AM -0400, Erik Winkler wrote:
> I noticed the following difference between the patched and unpatched 1.7.6 benchmark for LM hashes on OS X.  What does the patch change to cause such a significant difference?  The Traditional DES calculation is not affected, so I am not sure of the cause.

The patch does not change anything directly relevant.

In general, the LM hash code is more susceptible (than code for other
hash types) to "spurious" performance changes resulting from program
memory layout changes, but those changes are usually within 10%.  What
you have observed is a bit extreme.  I am not able to trigger this.  For
me, performance is roughly the same with and without the jumbo patch.

> Unpatched:
> 
> ../run/john -test=10 -format:LM
> Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
> Raw:	13021K c/s real, 13034K c/s virtual
> 
> After applying jumbo patch:
> 
> ../run/john -test=10 -format:LM
> Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
> Raw:	10317K c/s real, 10338K c/s virtual

I think that this has nothing to do with the jumbo patch itself, nor
with compiler options.  The patch simply adds more code and data
(elsewhere), which changes the addresses of some pieces of code and data
used by the LM hash code.  As a result, addresses of a pair of related
pieces of code or data may get overlapping cache tags, which will make
them "unfriendly" to each other staying in the cache at the same time.
Yet I am surprised by the extreme performance difference seen above.
It would be customary to see something like this on an Alpha with
direct-mapped L1 data cache, but x86 CPUs have at least 2-way set
associative L1 caches - and usually 4-way or better.  Well, maybe if
your specific CPU is only 2-way for the L1 instruction and/or data
cache, then this might explain the unstable performance.  I recall that
it was tricky to obtain stable performance on the original Intel
Pentium (in 1990s), which had 2-way L1 caches (8 KB as two 4 KB sets).

JtR tries to mitigate issues like this to some extent by combining
related pieces of frequently-accessed data into structures (e.g.,
DES_bs_all for the bitslice DES implementation).  Without that, things
could be a lot worse.  Yet it is possible, say, that a specific
sensitive/relevant portion of DES_bs_all "overlaps" with the stack,
or/and that DES_bs_set_key_LM() "overlaps" with DES_bs_crypt_LM() as
these come from different source files (and thus different object files,
the relative placement of which may vary from build to build).  A 2-way
set associative cache would be enough to deal with these specific
examples, but if a third component happens to "overlap" with the two,
then we have the problem again.

We'd need to study specific addresses or/and use the CPU's performance
counters to confirm or disprove this guess as it relates to slowdown
seen on specific builds.  An easier approach may be to move a pre-built
binary that exhibits the problem to another machine - with a CPU with
better caches - and see if the problem goes away or stays.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.