john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150505170550.GA27589@openwall.com>
Date: Tue, 5 May 2015 20:05:50 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

Agnieszka,

Going forward please mention specific branches and commits in your
messages, without me having to ask, so that I can take a look at your
code sooner.

I've just skimmed over
d3db03708e2a3b30e9ac3954b42082b6ac6e87c3 in the interleaving branch at
https://github.com/Lucife-r/JohnTheRipper

I notice that you use 2x interleaving factor with SSE2, but 4x with
AVX2.  Why this specific choice?  If you were trying to match the SIMD
vector width, then that's flawed logic.  On the contrary, with wider
SIMD vectors a smaller interleaving factor might work better if the
cache sizes stay the same.

I suggest that you try 2x interleaving with AVX2.  There's little point
in going for 4x without having seen speedup with 2x first.

Then, you're unnecessarily relying on compiler optimizations too much.
You've turned random_number, index_local, and index_global from
variables into arrays, and you're hoping(?) that the compiler will
allocate groups of 2 or 4 registers for them anyway.  Well, it might or
it might not.  Please use simple variables, like the original code did.
So random_number0, random_number1, etc.  You're using explicit indices
anyway, so this won't complicate your code.

Please omit or rework index_global_t.  There's no point in precomputing
values like "index_global_t[3]=index_global[3]+3;" when you're then only
using them like "S[index_global_t[3]]".  If S[] elements were no larger
than 8 bytes each, then the CPU's addressing modes would enable e.g.
S+i*8+24 to be calculated during effective address calculation in the
load instruction at no extra cost.  This doesn't work for elements
larger than 8 (and yours are __m256i), so it makes sense to precompute
them multiplied by sizeof(__m256i), and then access the data via a macro
that would do the proper typecasts to use byte offsets.  Not only for
index_global_t, but also for i0 and index_local*, so that the
multiplication by sizeof(__m256i) (shift left by 5) would be performed
less frequently, and then +32, +64, +96, etc. would be added to it.

On Sat, May 02, 2015 at 06:14:05AM +0200, Agnieszka Bielec wrote:
> I made interleaving for no-SIMD, SSE2 and AVX2 version, the speed for
> costs 2,2 and 0,0 is slightly better but for costs 6,6 and 8,8 is
> worse, so I'm not sure if I did everything correctly.

Given POMELO's use of memory, interleaving might in fact be of little
help, as the more memory you use at once, the slower the memory accesses
become as you're getting further out of cache.  I think this is why
you're not seeing a speedup with only your initial implementation, not
optimized yet.  You might or might not see more of a speedup when you
implement optimizations such as what I suggested above.

Also, please try 2x rather than 4x interleaving for AVX2.  With 4x, we
might be increasing our working set size unnecessarily and the register
pressure might be too high.  Remember that on x86_64 we have only 16
SIMD registers plus 16 scalar registers.  If our code needs more, it
starts spilling registers to memory (well, usually to L1 cache indeed).

I suggest that you review the generated assembly code without and with
interleaving.  See if extra instructions get generated (such as spilling
registers to memory and loading them back).

Also, find those left shifts that are used to calculate byte offsets
from indices.  See if any can be avoided or moved to outer loops.
Perhaps some of these optimizations can also be made to non-interleaved
code (and even submitted back to the author of POMELO).

> Maybe it's because we have bigger gaps between chunks of data in memory

No, I think the memory layout is fine.  When different cache lines are
accessed, it does not matter how large or small the gap between their
currently cached addresses is.

However, I suggest that you align the memory allocations to be on cache
line boundary.  Right now, you align them to 32 bytes as AVX2 requires,
but our cache lines are 64 bytes.  Crossing a cache line boundary
unnecessarily has performance cost and it thrashes other valuable data
out of cache (it thrashes two cache lines instead of just one).

Oh, and in the SSE2/AVX code you're not aligning the memory allocation
of S at all, so you only get the current malloc()'s guaranteed 16-byte
alignment.  This might or might not happen to also be 64-byte aligned.
You should explicitly make it at least 64-byte aligned.

Preferably, align S to page boundary (4096 bytes).  And do this for
non-interleaved implementations too, for fair benchmarks against them.

> I uploaded my code on the branch "interleaving"

Oh, I missed this at first. :-(

> I'm including results:
> 
> well SSE2 interleaving (I forgot to change algorithm_name to "SSE2")

As magnum correctly suggested, this should be automatic.  Also, it
should be reported as "AVX" when #ifdef __AVX__, because in that case
the compiler generates AVX instructions for the same SSE2 intrinsics.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.