Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 3 Aug 2015 16:53:39 +0800
From: Lei Zhang <>
Subject: Re: JtR on ARM (NEON)

> On Jul 31, 2015, at 4:35 PM, Solar Designer <> wrote:
> On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote:
>> A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is.
> You should check /proc/cpuinfo under Linux.


processor	: 0
model name	: ARMv7 Processor rev 3 (v7l)
Features	: swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x3
CPU part	: 0xc0f
CPU revision	: 3

It looks like Tegra 3 or 4.

>> From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower.
> Yes.  That's weird.
> I assume you haven't started playing with interleaving factors yet?

Not yet.

>> In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated.
> As I told you before, no, vcmov must not be emulated - we have it on
> NEON natively.  Please see how it's done in DES_bs_b.c.
> As to vroti, yes, although there's a 2-instruction way to emulate it,
> see page 4 in:
> Maybe it'd work faster at high interleaving factors (and slower at low
> interleaving factors, since it's higher latency than the straightforward
> 3-instruction approach).

I got your points. NEON's vbsl works just as a vcmov.

With the 2-intruction emulation of vroti, PBKDF2-HMAC-SHA256 got a boost from 644 c/s real to 976 c/s real. Other formats saw no significant performance change.

However, I have some problem with the emulation of vroti. Literally, it should be defined this way:

#define vroti_epi32(x, i) \
	(i > 0 ? vsliq_n_u32(vshrq_n_u32(x, 32 - (i)), x, i) : \
	         vsriq_n_u32(vshlq_n_u32(x, 32 + (i)), x, -(i)))

Somehow it won't compile when and only when building rawSHA1_ng_fmt_plug.o, giving some cryptic error message:

	/tmp/ccgVmq2d.s: Assembler messages:
	/tmp/ccgVmq2d.s:4397: Error: co-processor offset out of range

Manually changing -O2 to -O0 for rawSHA1_ng_fmt_plug.o could erase this error.

I googled it and found some other guys encountering this same issue under various circumstances. I think it's very possibly a compiler bug. Here's a related bug report:

The version of gcc used is 4.6.4. Unfortunately, there's currently no easy way of upgrading gcc on my mate's board. Solar, do you have a newer gcc on your ARM board?


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.