john-users - Re: Is JTR MPIrun can be optimized for more cores ?

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4B998C46.2020103@bredband.net>
Date: Fri, 12 Mar 2010 01:35:18 +0100
From: "Magnum, P.I." <rawsmooth@...dband.net>
To: john-users@...ts.openwall.com
Subject: Re: Is JTR MPIrun can be optimized for more cores ?

RB wrote:
> 2010/3/10 Solar Designer <solar@...nwall.com>:
>>> Please be aware that the MPI patch by itself induces (as I recall)
>>> 10-15% overhead in a single-core run.
>> Huh?  For "incremental" mode, it should have no measurable overhead.
> 
> This is what my experimentation showed.  Whether it's the MPI
> initialization or something else, the difference between patched and
> non-patched was statistically significant on my Phenom x4.  I'll
> repeat the tests to get more precise numbers, but it's why I made sure
> it was optional.

I did some testing. First an --inc=digits against 1634 DES hashes with 
the same salt, until completion. 129 of the hashes was cracked.

- john-jumbo2:          1m27.074s
- mpijohn on one core:  1m27.045s
- mpijohn on two cores: 1m7.025s

The lousy figure when using two cores is because one of the cores 
completed its run after just 22 seconds! Not optimal. I tried some 
longer jobs running alpha but the problem remains, one job completes 
long before the other. In real world usage, incremental mode is not 
supposed to complete and this won't be that much of a problem. On the 
other hand, the problem will be much larger using many more cores.

Anyway, the tests show that MPI-john has no overhead in itself, just as 
I expected. Running on one core, it performs just as vanilla john. It's 
*only* a matter of how we split the jobs. So I did another test, using 
my MPI patches that auto-splits the Markov range. In this mode the 
workload is always evenly split, and this mode is *supposed* to run to 
completion.

I ran -markov:250 against 15 LM hashes until completion. 10 were cracked:

- john-jumbo2:                   10m0.313s
- mpijohn on one core:           10m1.249s
- john-jumbo2 manually parallel: 5m13.690s
- mpijohn on two cores:          5m14.277s

This is less than 0.2% overhead. Actually, all MPI overhead occurs 
before and after the jobs actually run. For a 100 times longer run, the 
overhead *should* be 1/100 of the one seen here - and thus completely 
insignificant.

The tests was run on Linux/amd64, using MPICH2, running on some Intel 
laptop core2 thingy with cpu speed pegged to 1.6 GHz (due to some 
problems with heat ^^ )

FWIW: in wordlist and single modes, john-fullmpi will currently leapfrog 
rules if used, and otherwise leapfrog words. I haven't yet tested if the 
latter would be better all the time. If so, and when loading wordlist to 
memory as introduced by the jumbo patch, an mpi job should ideally load 
only its own share of words. It's not very sensible to load the full 134 
MB "rockyou" wordlist in 32 copies to memory on a 32 core host.

cheers
magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.