john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150427012043.GA27103@openwall.com>
Date: Mon, 27 Apr 2015 04:20:43 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

On Sun, Apr 26, 2015 at 09:37:10PM +0200, Agnieszka Bielec wrote:
> 2015-04-25 13:39 GMT+02:00 Solar Designer <solar@...nwall.com>:
> 
> > A major task that you haven't approached yet is instruction interleaving
> > on the CPUs.  Do you understand this concept?  Including why it helps?
> 
> I doubt  about that interleaving can help in pomelo.

It might, or it might not.  We should try.  Then re-test on future CPUs.

> Maybe I missed
> something. Interleaving can make possibility to make SIMD when it's
> not possible in one function execution.
> actually I think that this SIMD is good. Maybe you want to speed up
> the RAM access or maybe something else?

By interleaving, we mean primarily mixing of instructions from multiple
instances.  Not SIMD.  I understand what you mean by saying that
interleaving 2+ hash computations might enable use of SIMD, and we're
doing that too (e.g., we need 8 parallel MD5's to fill a 256-bit AVX2
vector), but that's not what we refer to when we say "interleaving".
We're also using interleaving on top of SIMD (so e.g. 16 or more
parallel MD5's per thread is likely optimal on AVX2, not just 8).

Please do take a look at and play with different versions of
php_mt_seed.  It uses both SIMD and interleaving at once.  If you modify
it to only use SIMD, and not interleaving, it'd become much slower.  You
need to understand why.

What's your understanding as to why interleaving might help, beyond SIMD?

As to POMELO's SIMD being good, yes, it appears to be good for up to
256-bit.  For 512-bit, such as on MIC and AVX-512, we'd need to
experiment.  It might be best just to waste the upper 256 bits, or we
might use those too (run two instances in the wider SIMD vectors
side-by-side).  In fact, something like this happens on GPUs too, but
this detail is hidden from you by the OpenCL "driver's"
auto-vectorization.  I think POMELO's performance significantly depends
on the device's efficiency at gather loads, of 256-bit quantities in
this case, and with how those are implemented in code (e.g., using
native gather load instructions, although those typically support up to
64-bit vector elements only, so might be wasteful, or with explicit
loads/shifts of the 256-bit portions).

Once again, interleaving is a separate thing, on top of SIMD, although
it will need to be tuned along with SIMD (what interleaving is optimal
may vary depending on how we use SIMD, and vice versa).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.