john-dev - Re: JtR on ARM (NEON)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <4FE77D30-F08A-4F44-948F-3FF1FA5ED8A5@gmail.com>
Date: Mon, 3 Aug 2015 16:53:39 +0800
From: Lei Zhang <zhanglei.april@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: JtR on ARM (NEON)

> On Jul 31, 2015, at 4:35 PM, Solar Designer <solar@...nwall.com> wrote:
> 
> On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote:
>> A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is.
> 
> You should check /proc/cpuinfo under Linux.

Core-0:

processor	: 0
model name	: ARMv7 Processor rev 3 (v7l)
Features	: swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x3
CPU part	: 0xc0f
CPU revision	: 3

It looks like Tegra 3 or 4.

>> From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower.
> 
> Yes.  That's weird.
> 
> I assume you haven't started playing with interleaving factors yet?

Not yet.

>> In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated.
> 
> As I told you before, no, vcmov must not be emulated - we have it on
> NEON natively.  Please see how it's done in DES_bs_b.c.
> 
> As to vroti, yes, although there's a 2-instruction way to emulate it,
> see page 4 in:
> 
> https://cryptojedi.org/papers/neoncrypto-20120320.pdf
> 
> Maybe it'd work faster at high interleaving factors (and slower at low
> interleaving factors, since it's higher latency than the straightforward
> 3-instruction approach).

I got your points. NEON's vbsl works just as a vcmov.

With the 2-intruction emulation of vroti, PBKDF2-HMAC-SHA256 got a boost from 644 c/s real to 976 c/s real. Other formats saw no significant performance change.

However, I have some problem with the emulation of vroti. Literally, it should be defined this way:

#define vroti_epi32(x, i) \
	(i > 0 ? vsliq_n_u32(vshrq_n_u32(x, 32 - (i)), x, i) : \
	         vsriq_n_u32(vshlq_n_u32(x, 32 + (i)), x, -(i)))

Somehow it won't compile when and only when building rawSHA1_ng_fmt_plug.o, giving some cryptic error message:

	/tmp/ccgVmq2d.s: Assembler messages:
	/tmp/ccgVmq2d.s:4397: Error: co-processor offset out of range

Manually changing -O2 to -O0 for rawSHA1_ng_fmt_plug.o could erase this error.

I googled it and found some other guys encountering this same issue under various circumstances. I think it's very possibly a compiler bug. Here's a related bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47246

The version of gcc used is 4.6.4. Unfortunately, there's currently no easy way of upgrading gcc on my mate's board. Solar, do you have a newer gcc on your ARM board?

Lei

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.