Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 31 Jul 2015 11:35:17 +0300
From: Solar Designer <>
Subject: Re: JtR on ARM (NEON)

On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote:
> A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is.

You should check /proc/cpuinfo under Linux.

> (OpenMP is disabled in this test. PBKDF2-HMAC-SHA512 failed somehow, so I chose sha512crypt here.)

You'll need to investigate why PBKDF2-HMAC-SHA512 fails.  This might
provide a clue as to why sha512crypt became slower.

> Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 64/32 OpenSSL]... DONE

BTW, the 64/32 here is wrong.  Should be 32/32.  Just because an
algorithm uses 64-bit integers logically doesn't mean we should report
it as using 64 out of 32 physical bits, since it can't.  magnum?

> From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower.

Yes.  That's weird.

I assume you haven't started playing with interleaving factors yet?

> In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated.

As I told you before, no, vcmov must not be emulated - we have it on
NEON natively.  Please see how it's done in DES_bs_b.c.

As to vroti, yes, although there's a 2-instruction way to emulate it,
see page 4 in:

Maybe it'd work faster at high interleaving factors (and slower at low
interleaving factors, since it's higher latency than the straightforward
3-instruction approach).

BTW, when you emulate a rotate with two shifts, you may sometimes see
better results when you combine them with a XOR rather than an OR,
because crypto code tends to use XORs nearby, so the compiler will be
able to re-order the XORs if it sees an opportunity to hide latencies
that way.  With an OR and a XOR, it won't be easy for the compiler to
see that the OR is equivalent to a XOR in this particular case.

> But I don't think they're the excuses for the poor performance, since they're also emulated in a AVX build.

Yes, there must be something else as well.  Maybe unaligned accesses.


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ