john-dev - bcrypt on AVX2 (was: [john-users] Anyone want to benchmark AVX2 code for bcrypt)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131101160938.GA8676@openwall.com>
Date: Fri, 1 Nov 2013 20:09:39 +0400
From: Solar Designer <solar@...nwall.com>
To: "Sc00bz64@...oo.com" <sc00bz64@...oo.com>, john-dev@...ts.openwall.com
Subject: bcrypt on AVX2 (was: [john-users] Anyone want to benchmark AVX2 code for bcrypt)

Steve, all -

I've just experimented with this some more.  Attached is my current hack
of the code.  Specifically, I tested for the hypothesis that the
slowness was primarily due to us slightly exceeding L1 data cache size -
and no, this does not appear to be the case.

First, here is a benchmark of Steve's original code (rebuilt with gcc
4.9.0 20130707), running on 4770K at stock clocks:

solar@...l:~/j/bcrypt/bcryptavx2$ ./bcryptbench64
Tests PASSED
Benchmarking...
AVX2: 8*128 took 1.0205 sec (1003.4 h/s) 0
Normal: 1024 took 1.6826 sec (608.6 h/s) 0

With a trivial Makefile edit:

#CFLAGS64=-c -Wall -m64 -O2
CFLAGS64=-c -Wall -m64 -O2 -fomit-frame-pointer -funroll-loops

solar@...l:~/j/bcrypt/bcryptavx2$ ./bcryptbench64
Tests PASSED
Benchmarking...
AVX2: 8*256 took 1.9977 sec (1025.2 h/s) 0
Normal: 1024 took 1.5484 sec (661.3 h/s) 0

With the Makefile edit above and running 8 instances of the program
simultaneously (with a shell script):

AVX2: 8*128 took 1.9539 sec (524.1 h/s) 0
AVX2: 8*128 took 1.9529 sec (524.4 h/s) 0
AVX2: 8*128 took 1.9527 sec (524.4 h/s) 0
AVX2: 8*128 took 1.9573 sec (523.2 h/s) 0
AVX2: 8*128 took 1.9569 sec (523.3 h/s) 0
AVX2: 8*128 took 1.9584 sec (522.9 h/s) 0
AVX2: 8*128 took 1.9603 sec (522.4 h/s) 0
AVX2: 8*128 took 1.9634 sec (521.6 h/s) 0
Normal: 1024 took 1.7368 sec (589.6 h/s) 0
Normal: 1024 took 1.7360 sec (589.9 h/s) 0
Normal: 1024 took 1.7364 sec (589.7 h/s) 0
Normal: 1024 took 1.7361 sec (589.8 h/s) 0
Normal: 1024 took 1.7386 sec (589.0 h/s) 0
Normal: 1024 took 1.7389 sec (588.9 h/s) 0
Normal: 1024 took 1.7386 sec (589.0 h/s) 0
Normal: 1024 took 1.7390 sec (588.8 h/s) 0

At this point, I hacked the code severely, changing the way the S-boxes
are laid out in memory.  Steve's original layout was such that the 8
instances used adjacent 32-bit words, including in S-boxes.  In my hack,
instance 0 uses offset 0, instance 1's S-boxes start at offset 4 KB, etc.
This lets me reduce the address range consumed by S-boxes of e.g. 7
bcrypt instances from 32 KB to 28 KB.  Without the layout change, even
if I used gather loads with masks I would have 32 KB with many 32-bit
gaps (so same number of L1 cache lines as with full 32 KB), not 28 KB.

Here's how this code performs when using the AVX2 vector width fully
(same as with Steve's original):

solar@...l:~/j/bcrypt/bcryptavx2-hack$ ./bcryptbench64 
Tests PASSED
Benchmarking...
AVX2: 8*256 took 1.9996 sec (1024.2 h/s) 0
Normal: 1024 took 1.5479 sec (661.5 h/s) 0

Almost same speed.  Good.

AVX2: 8*64 took 1.0041 sec (509.9 h/s) 0
AVX2: 8*64 took 1.0053 sec (509.3 h/s) 0
AVX2: 8*64 took 1.0065 sec (508.7 h/s) 0
AVX2: 8*64 took 1.0067 sec (508.6 h/s) 0
AVX2: 8*64 took 1.0068 sec (508.5 h/s) 0
AVX2: 8*64 took 1.0077 sec (508.1 h/s) 0
AVX2: 8*64 took 1.0091 sec (507.4 h/s) 0
AVX2: 8*64 took 1.0099 sec (507.0 h/s) 0
Normal: 1024 took 1.7359 sec (589.9 h/s) 0
Normal: 1024 took 1.7363 sec (589.8 h/s) 0
Normal: 1024 took 1.7360 sec (589.9 h/s) 0
Normal: 1024 took 1.7361 sec (589.8 h/s) 0
Normal: 1024 took 1.7393 sec (588.8 h/s) 0
Normal: 1024 took 1.7388 sec (588.9 h/s) 0
Normal: 1024 took 1.7391 sec (588.8 h/s) 0
Normal: 1024 took 1.7390 sec (588.8 h/s) 0

Very slightly slower.  OK.

Now let's try computing only 7 instances of bcrypt, to ensure there's
room left in L1 data cache for other uses (such as for P-boxes, etc.):

solar@...l:~/j/bcrypt/bcryptavx2-hack$ ./bcryptbench64 
Some tests failed, but this may be OK since 7 AVX2 code outputs are correct
AVX2 bcrypt failed to produce valid hashes.
Benchmarking...
AVX2: 7*256 took 1.9343 sec (926.5 h/s) 0
Normal: 1024 took 1.5486 sec (661.2 h/s) 0

The running time has reduced slightly (1.9996 sec to 1.9343 sec), but
not enough to result in any speedup in terms of c/s rate - in fact,
the c/s rate is significantly lower.

Let's try with 8 instances of the program:

Benchmarking...
AVX2: 7*128 took 1.8653 sec (480.4 h/s) 0
AVX2: 7*128 took 1.8665 sec (480.1 h/s) 0
AVX2: 7*128 took 1.8692 sec (479.4 h/s) 0
AVX2: 7*128 took 1.8700 sec (479.1 h/s) 0
AVX2: 7*128 took 1.8702 sec (479.1 h/s) 0
AVX2: 7*128 took 1.8733 sec (478.3 h/s) 0
AVX2: 7*128 took 1.8738 sec (478.2 h/s) 0
AVX2: 7*128 took 1.8740 sec (478.1 h/s) 0
Normal: 1024 took 1.7367 sec (589.6 h/s) 0
Normal: 1024 took 1.7364 sec (589.7 h/s) 0
Normal: 1024 took 1.7362 sec (589.8 h/s) 0
Normal: 1024 took 1.7360 sec (589.8 h/s) 0
Normal: 1024 took 1.7389 sec (588.9 h/s) 0
Normal: 1024 took 1.7388 sec (588.9 h/s) 0
Normal: 1024 took 1.7388 sec (588.9 h/s) 0
Normal: 1024 took 1.7387 sec (588.9 h/s) 0

No luck with this as well - we're still slower than the original and
slower than non-vectorized code.

I did try fewer than 7 instances per AVX2 vector as well - no luck
getting a performance improvement with e.g. 6 and 4 as well.  It is
possible, though, that a careful mix of 128-bit AVX2 (not just mask
applied to 256-bit AVX2, but 128-bit AVX2 instructions) with scalar
instructions would be slightly faster.  Such a mix can involve
interleaved instructions in the same thread or/and one vector and one
scalar thread scheduled to run on two halves of the same CPU core (would
need to set the affinity in this way).  This is yet to be tested.

For comparison, JtR achieves higher speeds on this same CPU than any of
the numbers seen above:

Will run 8 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    6595 c/s real, 824 c/s virtual

This is because "Normal" in the benchmarks above corresponds to one
instance of bcrypt, whereas JtR's scalar code runs two instances with
interleaved instructions.

Alexander

On Wed, Jun 26, 2013 at 09:09:27AM -0700, Sc00bz64@...oo.com wrote:
> Windows 8 x64 4770k at 4.1GHz
> 
> 
> Single threaded performance:
> AVX2: 868.6 h/s
> Hashcat: 1170 h/s (hashcat-cli64.exe -m 3200 -a 3 -n 1 m3200.txt -1 ?l?u?d?s ?1?1?1?1?1?1?1)
> 
> So not using AVX2 is faster. One reason might be that it runs out of L1 cache (needs more than 32.5 KiB but there's only 32 KiB of L1) and has to hit L2.
> 
> 
> 
> 
> ----- Original Message -----
> From: "Sc00bz64@...oo.com" <sc00bz64@...oo.com>
> To: "john-users@...ts.openwall.com" <john-users@...ts.openwall.com>
> Cc: 
> Sent: Wednesday, June 26, 2013 6:47 AM
> Subject: [john-users] Anyone want to benchmark AVX2 code for bcrypt
> 
> Run this http://www.tobtu.com/files/bcryptavx2.zip (the source is included along with a 32 bit Windows and a 64 bit Linux binaries).
> 
> The non-AVX2 code is a little slow. I was too lazy to interlace it. Anyway 
> if you also have numbers from JtR (--format=bf) or Hashcat (-m 3200). 
> It's doing $2y$05$...

Download attachment "bcryptavx2-hack-1.tar.gz" of type "application/x-gzip" (23888 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.