john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHXMh78LocrhnJ2omEUq2quVeKAuNSmCstB2mhvZOMdjrQ@mail.gmail.com>
Date: Sat, 9 May 2015 22:25:06 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

2015-05-05 19:05 GMT+02:00 Solar Designer <solar@...nwall.com>:
> Agnieszka,
> I notice that you use 2x interleaving factor with SSE2, but 4x with
> AVX2.  Why this specific choice?

without any reason

> I suggest that you try 2x interleaving with AVX2.  There's little point
> in going for 4x without having seen speedup with 2x first.

with 2x interleaving for avx2 is lower decrease of the speed for cost=6:6,6:6
and without the difference for 2 or 4

> Then, you're unnecessarily relying on compiler optimizations too much.
> You've turned random_number, index_local, and index_global from
> variables into arrays, and you're hoping(?) that the compiler will
> allocate groups of 2 or 4 registers for them anyway.  Well, it might or
> it might not.  Please use simple variables, like the original code did.
> So random_number0, random_number1, etc.  You're using explicit indices
> anyway, so this won't complicate your code.

changed

> Please omit or rework index_global_t.  There's no point in precomputing
> values like "index_global_t[3]=index_global[3]+3;" when you're then only
> using them like "S[index_global_t[3]]".  If S[] elements were no larger
> than 8 bytes each, then the CPU's addressing modes would enable e.g.
> S+i*8+24 to be calculated during effective address calculation in the
> load instruction at no extra cost.  This doesn't work for elements
> larger than 8 (and yours are __m256i), so it makes sense to precompute
> them multiplied by sizeof(__m256i), and then access the data via a macro
> that would do the proper typecasts to use byte offsets.  Not only for
> index_global_t, but also for i0 and index_local*, so that the
> multiplication by sizeof(__m256i) (shift left by 5) would be performed
> less frequently, and then +32, +64, +96, etc. would be added to it.

I omitted index_global_t

> On Sat, May 02, 2015 at 06:14:05AM +0200, Agnieszka Bielec wrote:
>> I made interleaving for no-SIMD, SSE2 and AVX2 version, the speed for
>> costs 2,2 and 0,0 is slightly better but for costs 6,6 and 8,8 is
>> worse, so I'm not sure if I did everything correctly.
>
> Given POMELO's use of memory, interleaving might in fact be of little
> help, as the more memory you use at once, the slower the memory accesses
> become as you're getting further out of cache.  I think this is why
> you're not seeing a speedup with only your initial implementation, not
> optimized yet.  You might or might not see more of a speedup when you
> implement optimizations such as what I suggested above.


> I suggest that you review the generated assembly code without and with
> interleaving.  See if extra instructions get generated (such as spilling
> registers to memory and loading them back).

interleaved function (SSE2) contains more lea intructions

none@...e ~/Desktop $ cat sse.asm | grep lea | wc -l
189
none@...e ~/Desktop $ cat sseold.asm | grep lea | wc -l
59

none@...e ~/Desktop $ cat sse.asm | grep rbp | wc -l
141
none@...e ~/Desktop $ cat sseold.asm | grep rbp | wc -l
27

none@...e ~/Desktop $ cat sseold.asm | grep movdqu | wc -l
126
none@...e ~/Desktop $ cat sse.asm | grep movdqu | wc -l
264

It looks like additional set of instruction isn't interleaved with the
original one

> Also, find those left shifts that are used to calculate byte offsets
> from indices.  See if any can be avoided or moved to outer loops.
> Perhaps some of these optimizations can also be made to non-interleaved
> code (and even submitted back to the author of POMELO).

In my opinion they can't be moved to the better place

>
>> Maybe it's because we have bigger gaps between chunks of data in memory
>
> No, I think the memory layout is fine.  When different cache lines are
> accessed, it does not matter how large or small the gap between their
> currently cached addresses is.

but what when the came cache lines are accessed when we have less
memory usage and other when we use more memory. I mean the H and
jumping to random numbers and L2 cache

>
> However, I suggest that you align the memory allocations to be on cache
> line boundary.  Right now, you align them to 32 bytes as AVX2 requires,
> but our cache lines are 64 bytes.  Crossing a cache line boundary
> unnecessarily has performance cost and it thrashes other valuable data
> out of cache (it thrashes two cache lines instead of just one).
>
>
> Oh, and in the SSE2/AVX code you're not aligning the memory allocation
> of S at all, so you only get the current malloc()'s guaranteed 16-byte
> alignment.  This might or might not happen to also be 64-byte aligned.
> You should explicitly make it at least 64-byte aligned.

this is also done now

> As magnum correctly suggested, this should be automatic.  Also, it
> should be reported as "AVX" when #ifdef __AVX__, because in that case
> the compiler generates AVX instructions for the same SSE2 intrinsics.

It's automatic. I tested avx2 and sse2 on well where avx2 is
supported, so I changed only one function call  for the test. I could
also not to make export for gcc but changing a function call was
faster for me.

___

I noticed that SSE2 is slightly faster

sse2 usual interleaving:

a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=2:2,2:2
Will run 8 OpenMP threads
Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE
Speed for cost 1 (N) of 2, cost 2 (r) of 2
Many salts:    104192 c/s real, 13024 c/s virtual
Only one salt:    104448 c/s real, 13056 c/s virtual

a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=4:4,4:4
Will run 8 OpenMP threads
Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE
Speed for cost 1 (N) of 4, cost 2 (r) of 4
Many salts:    6525 c/s real, 816 c/s virtual
Only one salt:    6525 c/s real, 817 c/s virtual


a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=6:6,6:6
Will run 8 OpenMP threads
Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE
Speed for cost 1 (N) of 6, cost 2 (r) of 6
Many salts:    322 c/s real, 43.4 c/s virtual
Only one salt:    320 c/s real, 42.7 c/s virtual



sse2 interleaving after modyfications:

a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=2:2,2:2
Will run 8 OpenMP threads
Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE
Speed for cost 1 (N) of 2, cost 2 (r) of 2
Many salts:    109056 c/s real, 13649 c/s virtual
Only one salt:    109056 c/s real, 13632 c/s virtual


a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=4:4,4:4
Will run 8 OpenMP threads
Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE
Speed for cost 1 (N) of 4, cost 2 (r) of 4
Many salts:    6776 c/s real, 844 c/s virtual
Only one salt:    6710 c/s real, 843 c/s virtual

a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=6:6,6:6
Will run 8 OpenMP threads
Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE
Speed for cost 1 (N) of 6, cost 2 (r) of 6
Many salts:    341 c/s real, 45.2 c/s virtual
Only one salt:    341 c/s real, 45.3 c/s virtual

the reduction in speed is the same for bigger costs except avx2 when I
modified interleaving into 2x, now the reduction is similar to SSE2
and no-SIMD versions

thanks
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.