john-dev - Re: JtR on ARM (NEON)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150731083517.GB31035@openwall.com>
Date: Fri, 31 Jul 2015 11:35:17 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: JtR on ARM (NEON)

On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote:
> A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is.

You should check /proc/cpuinfo under Linux.

> (OpenMP is disabled in this test. PBKDF2-HMAC-SHA512 failed somehow, so I chose sha512crypt here.)

You'll need to investigate why PBKDF2-HMAC-SHA512 fails.  This might
provide a clue as to why sha512crypt became slower.

> Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 64/32 OpenSSL]... DONE

BTW, the 64/32 here is wrong.  Should be 32/32.  Just because an
algorithm uses 64-bit integers logically doesn't mean we should report
it as using 64 out of 32 physical bits, since it can't.  magnum?

> From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower.

Yes.  That's weird.

I assume you haven't started playing with interleaving factors yet?

> In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated.

As I told you before, no, vcmov must not be emulated - we have it on
NEON natively.  Please see how it's done in DES_bs_b.c.

As to vroti, yes, although there's a 2-instruction way to emulate it,
see page 4 in:

https://cryptojedi.org/papers/neoncrypto-20120320.pdf

Maybe it'd work faster at high interleaving factors (and slower at low
interleaving factors, since it's higher latency than the straightforward
3-instruction approach).

BTW, when you emulate a rotate with two shifts, you may sometimes see
better results when you combine them with a XOR rather than an OR,
because crypto code tends to use XORs nearby, so the compiler will be
able to re-order the XORs if it sees an opportunity to hide latencies
that way.  With an OR and a XOR, it won't be easy for the compiler to
see that the OR is equivalent to a XOR in this particular case.

> But I don't think they're the excuses for the poor performance, since they're also emulated in a AVX build.

Yes, there must be something else as well.  Maybe unaligned accesses.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.