john-users - Re: Best performance MPI vs OMP

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120301213735.GB2793@openwall.com>
Date: Fri, 2 Mar 2012 01:37:35 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Best performance MPI vs OMP

On Thu, Mar 01, 2012 at 06:47:32PM +0100, Javier Gonz?lez del T?nago Liberal wrote:
> I've been trying John and I noticed a big difference in performance 
> between MPI and OMP (LM, NTLM overall). These are the results:

Yes, "fast" hashes show poor OpenMP scaling.  However, the c/s rate is
not the only thing to consider - also relevant are ease of use and order
in which the candidate passwords are tried.  This is where OpenMP works
better, although on a 48-way machine the performance hit in terms of
c/s rate for "fast" hashes is just too large, so MPI is currently a
better option for those.

> - DES
>     OMP
>         Benchmarking: Traditional DES [128/128 BS SSE2-16]... (48xOMP) DONE
>         Many salts:    47087K c/s real, 982834 c/s virtual
>         Only one salt:    21921K c/s real, 457370 c/s virtual
>     MPI
>         Benchmarking: Traditional DES [128/128 BS SSE2-16]... (48xMPI) DONE
>         Many salts:    50263K c/s real, 50263K c/s virtual
>         Only one salt:    48422K c/s real, 48422K c/s virtual

As you can see, OpenMP shows a 93.7% efficiency for "many salts" here,
which is actually surprisingly good (it's usually at around 90% even for
much smaller systems).

OpenMP will only show better efficiency for slower hashes - such as for
Blowfish-based and MD5-based crypt(3), and for MSCash2.

> - LM
>     OMP
>         Benchmarking: LM DES [128/128 BS SSE2-16]... (48xOMP) DONE
>         Raw:    39714K c/s real, 828600 c/s virtual
>     MPI
>         Benchmarking: LM DES [128/128 BS SSE2-16]... (48xMPI) DONE
>         Raw:    684363K c/s real, 677587K c/s virtual

Yes, LM almost does not scale with the current OpenMP code.  It will
scale a bit for thread counts in the range of 2 to 8 or so, but going
further will just slow it down.  You may try OMP_NUM_THREADS=2 and
increase it slowly if you're curious what the optimal value and max
performance for LM with OpenMP is.  Another setting to experiment with
is GOMP_SPINCOUNT (try 10000, 100000, 1000000, 10000000).

> - NETHALFLM
>     OMP
>         Benchmarking: HalfLM C/R DES [nethalflm]... (48xOMP) DONE
>         Many salts:    29949K c/s real, 647146 c/s virtual
>         Only one salt:    1622K c/s real, 262564 c/s virtual
>     MPI
>         Benchmarking: HalfLM C/R DES [nethalflm]... (48xMPI) DONE
>         Many salts:    52215K c/s real, 52215K c/s virtual
>         Only one salt:    26010K c/s real, 26010K c/s virtual
> 
> - NETLM
>     OMP
>         Benchmarking: LM C/R DES [netlm]... (48xOMP) DONE
>         Many salts:    28550K c/s real, 647123 c/s virtual
>         Only one salt:    857480 c/s real, 231109 c/s virtual
>     MPI
>         Benchmarking: LM C/R DES [netlm]... (48xMPI) DONE
>         Many salts:    52331K c/s real, 51813K c/s virtual
>         Only one salt:    17337K c/s real, 17337K c/s virtual

These are reasonable numbers.

> Is that normal?

Yes.

> I suppose that in the same machine, the OMP implementation should work
> faster, isn't?

No, it should not.  Why would it?

OpenMP means close coordination between the threads, which involves
overhead (one thread may sometimes wait for another, data may need to be
transferred between the different CPUs' caches), not to mention that
MT-safe code is often slower on its own (because of higher register
pressure and more complicated addressing modes).  BTW, the latter means
that you might be able to get better MPI performance for some of the
hash types by building without OpenMP.  From your benchmarks above, it
is unclear whether your MPI ones are for an MPI-only or an MPI+OpenMP
build.

With MPI, there are separate processes, which are not synchronized to
each other.  So the order in which candidate passwords are tried is less
optimal (it does not reflect decreasing estimated probabilities as
closely), but the c/s rate is higher (no waiting, no extra data
transfers, no extra register pressure, no complicated addressing modes).

So far, the only exception I am aware of - where OpenMP is actually
faster in terms of c/s rate by a few percent - is the Blowfish-based
crypt(3) code on UltraSPARC T2.  My guess is that this is due to sharing
of code and mostly read-only data between the threads, which helps use
the CPU's L1 caches more optimally.

> 1.7.9-jumbo-5_mpi+omp [linux-x86-64]

So that's it - you need an MPI-only build for even better performance.
Also, instead of linux-x86-64 you may try linux-x86-64i (with the "i")
for much better performance at MD5-based hashes.

I would very much appreciate it if you submit your OpenMP and single CPU
core benchmarks to the wiki:

http://openwall.info/wiki/john/benchmarks

Please also post the corresponding MPI benchmarks in here, or maybe add
a third table to the wiki.

The benchmark results you posted so far are probably those relevant to
your intended use, which is just right, but for the wiki we need info on
specific hash types.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.