john-dev - Re: bcrypt on AVX2 (was: [john-users] Anyone want to benchmark AVX2 code for bcrypt)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20131101214255.GA11091@openwall.com>
Date: Sat, 2 Nov 2013 01:42:55 +0400
From: Solar Designer <solar@...nwall.com>
To: "Sc00bz64@...oo.com" <sc00bz64@...oo.com>, john-dev@...ts.openwall.com
Subject: Re: bcrypt on AVX2 (was: [john-users] Anyone want to benchmark AVX2 code for bcrypt)

On Fri, Nov 01, 2013 at 08:09:39PM +0400, Solar Designer wrote:
> Now let's try computing only 7 instances of bcrypt, to ensure there's
> room left in L1 data cache for other uses (such as for P-boxes, etc.):
> 
> solar@...l:~/j/bcrypt/bcryptavx2-hack$ ./bcryptbench64 
> Some tests failed, but this may be OK since 7 AVX2 code outputs are correct
> AVX2 bcrypt failed to produce valid hashes.
> Benchmarking...
> AVX2: 7*256 took 1.9343 sec (926.5 h/s) 0
> Normal: 1024 took 1.5486 sec (661.2 h/s) 0

I've also tried with icc 14.0.0, after adding "volatile" before "union"
in three places (to workaround what I think is an icc bug, which I also
faced with another program before).  Got similar speeds:

Some tests failed, but this may be OK since 7 AVX2 code outputs are correct
AVX2 bcrypt failed to produce valid hashes.
Benchmarking...
AVX2: 7*256 took 1.9314 sec (927.8 h/s) 0
Normal: 1024 took 1.9710 sec (519.5 h/s) 0

Well, somehow "Normal" was hit by the move from gcc to icc, but this
does not matter much - we know it can be much faster anyway with 2x
interleaving.  The AVX2 speed is still worse than interleaved scalar
code's speed.

I also took a look at the generated assembly code from both gcc and icc.
Both look sane, although there are lots of vmovdqa's due to the way
vpgatherdd is defined.  Maybe half of those vmovdqa's could be avoided
by programming in assembly rather than with intrinsics, since they are
there only to substitute the default value for the masked lanes.  We
don't care what this default value is (any garbage would do), but
_mm256_mask_i32gather_epi32() forces us to specify it.

I don't think avoiding those vmovdqa's would make enough of a
difference.  Possibly it'd make no difference at all: I think MOVs are
generally free starting with Ivy Bridge.

On the other hand, I got almost 2x speedup when going from AVX to AVX2
in php_mt_seed, as tested on this same 4770K CPU.  php_mt_seed's AVX
code speed on this CPU was worse (per MHz) than it used to be on older
CPUs, though, so going to AVX2 was in part a workaround.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.