Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 28 Jun 2010 16:16:47 +0400
From: Solar Designer <>
Subject: Re: bitslice DES parallelization with OpenMP


You can follow Openwall on Twitter now - announcements only, very low
volume - - and please retweet those whenever
you can. :-)  There are also several RSS feeds to choose from specified
on the John the Ripper homepage - - a feed
from Twitter and several from Gmane.

Now to the actual stuff:

I've identified an oversight in the 1.7.6-omp-des-2 patch - all threads
were using the "ones" variable from thread slot 0, which was too close
to that slot's written-to fields (same page).  I could correct this to
use per-thread "ones", but doing so was tricky in the existing code,
where the "tp" pointer is local to the "crypt body" functions, whereas
the "ones" variable is needed by S-box functions (in practice they're
inlined into the "crypt bodies", but this does not help with variable
scope at the source code level).  So I've hacked around this differently
in 1.7.6-omp-des-3 - by keeping this variable separate and at a distance
from the thread slots.  Hopefully, the cache associativity is
sufficient, and it's at most 1 cache line to bounce anyway.  The new
patch is uploaded to the wiki:

and I've also attached a diff against -omp-des-2 to this message.

Another change is setting "Idle = N" instead of the old default of
"Idle = Y".  This makes John less sensitive to other server load, but
unfortunately it also makes John less friendly to those other tasks.

The speed has improved - see below.

On Mon, Jun 28, 2010 at 02:44:02AM +0400, Solar Designer wrote:
> Next test system, Core i7 920 2.67 GHz (quad-core, 2 logical CPUs per
> core) with a little bit of other load (perhaps under 5%).  Normally, it
> does around 2600K c/s for multi-salt with one non-parallelized process,
> 11500K c/s combined with 8 separate processes.
> Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
> Many salts:     9802K c/s real, 1237K c/s virtual
> Only one salt:  4683K c/s real, 585472 c/s virtual

Same system, new benchmark:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     10025K c/s real, 1251K c/s virtual
Only one salt:  4786K c/s real, 597525 c/s virtual

This is due to the "ones" change only.  (The "Idle" setting has no
effect on "--test" benchmarks.  It only comes into play during actual
John runs.)

> guesses: 14  time: 0:00:00:34  c/s: 9491K  trying: maiemaie - mbvnmbvn

We're getting better actual speed now as well:

host!solar:~/john/john-1.7.6-omp-des-3/run$ ./john -e=double --salts=-2 ~/john/pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
guesses: 11  time: 0:00:00:00  c/s: 9183K  trying: iciici - jprjpr
jamjam           (u2207-des)
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:03  c/s: 9777K  trying: amkaamka - anxjanxj
guesses: 14  time: 0:00:00:08  c/s: 9884K  trying: clxcclxc - cnklcnkl
guesses: 14  time: 0:00:00:24  c/s: 9930K  trying: irzgirzg - itmpitmp
guesses: 14  time: 0:00:00:38  c/s: 9924K  trying: oejkoejk - ofwtofwt
woofwoof         (u1435-des)
guesses: 15  time: 0:00:01:09  c/s: 9838K  trying: zzwmzzwm - zzzzzzzz

> Finally, a faster system - dual Xeon X5460 (8 cores total) at 3.16 GHz,
> but with a bit more load (around 10%).  The benchmarks vary a lot:

The benchmarks still vary a lot (the load is changing), so I'll only
provide a test run output (under load).  Previously, I got:

> guesses: 15  time: 0:00:01:23  c/s: 8203K  trying: zzwmzzwm - zzzzzzzz

This time, things are a lot better:

host!solar:~/john$ ./john-omp-des-3 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:00  c/s: 16384K  trying: odwodw - prfprf
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:03  c/s: 15406K  trying: bjcubjcu - bkqdbkqd
guesses: 14  time: 0:00:00:07  c/s: 14431K  trying: dlqqdlqq - dndzdndz
guesses: 14  time: 0:00:00:12  c/s: 14441K  trying: ghwmghwm - gjjvgjjv
guesses: 14  time: 0:00:00:17  c/s: 14526K  trying: jfpsjfps - jhdbjhdb
guesses: 14  time: 0:00:00:25  c/s: 14642K  trying: nyhwnyhw - nzvfnzvf
woofwoof         (u1435-des)
guesses: 15  time: 0:00:00:46  c/s: 14722K  trying: zzwmzzwm - zzzzzzzz

In this case, I think the speedup is mostly due to "Idle = N", which
makes context switches less frequent (when there's other demand for CPU
time) and thereby keeps the threads almost in sync more often.

Enjoy, and don't forget to provide your feedback.


P.S. Setting DES_BS_EXPAND_MERGED to 1 provides some speedup for the
single-salt case (as expected, because the "expand" step gets
parallelized then), but it slows things down a bit for the multi-salt
case (at least on the Core i7 system).  The latter is unexpected - if
anyone can figure out the cause of the slowdown (or show that it
doesn't occur on other builds/systems maybe?), please let me know.

View attachment "john-1.7.6-omp-des-2to3.diff" of type "text/plain" (2344 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.