john-users - Re: bitslice DES parallelization with OpenMP

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100630004226.GA12393@openwall.com>
Date: Wed, 30 Jun 2010 04:42:26 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: bitslice DES parallelization with OpenMP

On Mon, Jun 28, 2010 at 02:44:02AM +0400, Solar Designer wrote:
> (The OpenMP parallelization code for slower hashes that is integrated
> into JtR 1.7.6 is a lot less sensitive to system load, likely because
> the synchronization is less frequent.  Maybe the same could be achieved
> by buffering a lot more candidate passwords with these faster hashes -
> beyond the current setting of 1024 - although the candidates would not
> fit in L1 data cache then, likely resulting in a minor slowdown when
> there's no other load.)

I've just confirmed the above guess.  Changing DES_bs_mt from 8 to 96,
I am getting a 1% to 2% slowdown on an otherwise idle system, but
roughly a 10% speedup on a system under significant non-John load.

Here's a new benchmark for the same 8-core Xeon system that was doing up
to 17.6M c/s before:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     19255K c/s real, 2403K c/s virtual
Only one salt:  5640K c/s real, 704143 c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     669148 c/s real, 83643 c/s virtual
Only one salt:  530070 c/s real, 67080 c/s virtual

The server load has stayed roughly the same - I can still reproduce the
under-18M c/s best speed with the previous build.

Here's a new test run:

host!solar:~/john$ ./john-omp-des-3-96 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
cloclo           (u2989-des)
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
jamjam           (u2207-des)
booboo           (u171-des)
bebe             (u1731-des)
gigi             (u2082-des)
cccccc           (u982-des)
jojo             (u3027-des)
lulu             (u3034-des)
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:01  c/s: 18158K  trying: ajjgajjg - bbnvbbnv
guesses: 14  time: 0:00:00:06  c/s: 17381K  trying: debsdebs - dwghdwgh
guesses: 14  time: 0:00:00:12  c/s: 18145K  trying: ibiaibia - itmpitmp
guesses: 14  time: 0:00:00:18  c/s: 17529K  trying: lofclofc - mgjrmgjr
guesses: 14  time: 0:00:00:27  c/s: 17059K  trying: rdqardqa - rvuprvup
guesses: 14  time: 0:00:00:34  c/s: 17017K  trying: wawiwawi - wtaxwtax
woofwoof         (u1435-des)
guesses: 15  time: 0:00:00:36  c/s: 16934K  trying: xlfoxlfo - ydkdydkd
guesses: 15  time: 0:00:00:40  c/s: 16972K  trying: zntkzntk - zzzzzzzz

In john.log, I am getting:

0:00:00:00 - Candidate passwords will be buffered and tried in chunks of 12288

Previously, with chunks of "only" 1024 candidate passwords, this would
complete in no less than 46 seconds:

guesses: 15  time: 0:00:00:46  c/s: 14722K  trying: zzwmzzwm - zzzzzzzz

1024 is the minimum for an 8-core system with SSE2 - all of those
candidate passwords are actually "in flight" on the CPUs at the same
time (8 cores computing 128 hashes each - in the 128 different bit
positions).  When the number of thread slots is increased beyond 8,
there are still only 1024 of different candidates being processed by the
CPUs at the same time (with 8 threads), but they don't always have to be
from the same consecutive block of 1024 buffered ones - they can be
scattered (to some extent) over the larger chunk.  This lets the threads
stay out of sync for a longer while (if they get out of sync for
whatever reason), only joining them at a later time (less frequently).
By that time, chances are that the relative time needed for the threads
to get back in sync (as a fraction of the total time the threads ran)
will be less.

Also, the larger chunks may make it reasonable to use
OMP_WAIT_POLICY=PASSIVE, which slows things down (at least on non-SMT
CPUs) yet makes John a lot more friendly to other tasks.  With smaller
chunks - and more frequent switches between sequential and parallel
processing - it was way too expensive to go for passive waits (which
involve syscalls).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.