john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150425113905.GA19072@openwall.com>
Date: Sat, 25 Apr 2015 14:39:05 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

Hi Agnieszka,

On Sat, Apr 25, 2015 at 04:27:49AM +0200, Agnieszka Bielec wrote:
> I'm sending more benchmarking tests.

These are very nice.  Maybe you'd format this spreadsheet such that we
could export it into a reasonably-looking PDF?  And add descriptions of
our systems in there (what actual hardware dev=1 corresponds to, etc.)
Oh, and actual sizes corresponding to the different m_cost settings, in
kilobytes.  Then we'll "deliver" this as a report to the PHC community.

> Sorry, in this week I had a lot of work activities related to my university

No problem.

> Isn't it strange that sse2 is faster than avx2 for greater costs values??

Where are you seeing that?  Are you perhaps looking at "sse2" on super
vs. "avx2" on well?  If so, no, it's not strange.  super is a much
faster machine than well: 16 cores (32 threads) vs. 4 cores (8 threads).
Also, super has 8 memory channels, and well has 2.  (This matters for
high m_cost, when we're out of cache.)  super has a total of 40 MB of L3
cache (2.5 MB per core), well has 8 MB (2 MB per core).  well's higher
CPU clock rate and AVX2 can't fully compensate for those many advantages
of super.

The opposite could be a bit strange - that super is slower than well at
any cost settings at all.  But there's an explanation for this: at high
c/s rates, overhead plays more of a role, especially with OpenMP.
Indeed, when doing a million hashes per second, even slight
desynchronization between the threads results in some of the threads
waiting for others at the end of an OpenMP parallel block.  (FWIW,
higher OMP_SCALE helps reduce this effect by letting OpenMP (re)allocate
work dynamically.  With low OMP_SCALE, there's simply too little work
for that.)  At lower c/s rates, this effect is not so profound because
slight discrepancies in threads' performance correspond to a much
smaller fraction of their total running time.

If you benchmark with --fork rather than with OpenMP, you'll likely see
super performing better than well at all cost settings without exception.
You'd use --fork=8 on well and --fork=32 on super, and use the many
processes' cumulative speeds.  Yes, this is inconvenient if you need to
run many such benchmarks, so I don't actually suggest it.  I just point
out that the OpenMP overhead is avoidable.

Also, maybe you didn't use "export GOMP_CPU_AFFINITY=0-31" in some (or
all?) of your tests on super.  It usually needs that setting, especially
at high c/s rates.

BTW, surely your "sse2" is actually AVX.  You used the "SSE2" version of
the source code, but when those same intrinsics are compiled with AVX
enabled, the compiler produces the corresponding AVX instructions for
them.  We should probably document these benchmarks as "AVX" and "AVX2"
when bringing them to PHC.

What's actually puzzling is the sharp decrease in performance with
higher t_cost on GPUs.  There's expected to be a 4x decrease when you
increase t_cost by 2, but e.g. for "private dev=5", m_cost=4 we see a
100x+ decrease when going from t_cost=2 to t_cost=4.  My guess is that
your OpenCL kernel does not fit in GPUs' L1 caches, and its shorter
running times result in greater reuse of instruction fetches by the
different wavefronts/warps as these gradually become more out of sync.
If so, you can probably improve performance in those high t_cost cases
(and possibly for lower costs as well) by reducing your kernel's code
size.  But that's just a guess, which might as well be wrong.  You could
want to check the GPU ISA level code size first.

A major task that you haven't approached yet is instruction interleaving
on the CPUs.  Do you understand this concept?  Including why it helps?
While we use it in JtR in various formats, I think it's better
illustrated by the evolution of my php_mt_seed program:

http://www.openwall.com/php_mt_seed/

You may start with an older version of it, and see how much faster it
became since then and _why_.  You may also try reimplementing some of
those same optimizations on your own, just to practice, without looking
_too_ closely at the already-made optimizations (just skim over them to
get an overall idea of the approach taken).  I think this is a good way
to learn both SIMD programming, and interleaving (which is a concept
relevant with and without SIMD).

http://cvsweb.openwall.com/cgi/cvsweb.cgi/projects/php_mt_seed/php_mt_seed.c
http://download.openwall.net/pub/projects/php_mt_seed/

Unfortunately, the oldest version of php_mt_seed already includes 2x
interleaving, but I brought it much further in later versions (to 8x
interleaving and SIMD at once).

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.