john-dev - Re: PHC: Argon2 on CPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150806140204.GD18936@openwall.com>
Date: Thu, 6 Aug 2015 17:02:04 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on CPU

Agnieszka,

On Sun, Aug 02, 2015 at 10:46:00PM +0200, Agnieszka Bielec wrote:
> hi, I have argon2i/d on CPU and GPU although I have not optimizations on GPU yet
> argon2i/d is in both versions: REF and OPT-SSE
> turned out that OPT-SSE after I removed SSE is faster than REF and my

In other words, you produced a SIMD-less version based on the SIMD
version's source code?  If so, you should keep these two faster versions
(one for use on SIMD-capable CPUs, the other on SIMD-less CPUs and on
archs for which we don't yet have SIMD intrinsics) in a JtR format (the
CPU one), choosing the faster one based on whether a suitable SIMD
instruction set is enabled for a given John build or not.  In fact, this
is what we should do for all other formats as well (and what we already
do for many in core and jumbo trees, starting e.g. with descrypt, which
was the very first format JtR ever supported).

> GPU version bases on this

You're basing your Argon2 OpenCL code on SIMD-less CPU code?  Why is
that?  Wouldn't a vectorized OpenCL kernel likely run faster?

> results:
> 
> OPT-SSE
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i
> Will run 8 OpenMP threads
> Benchmarking: argon2i [AVX]... (8xOMP)
> memory per hash : 100.00 kB
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100
> Raw:    31232 c/s real, 3908 c/s virtual
> 
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d
> Will run 8 OpenMP threads
> Benchmarking: argon2d [AVX]... (8xOMP)
> memory per hash : 100.00 kB
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100
> Raw:    35328 c/s real, 4483 c/s virtual

I understand that you set these to the same parameters for a
straightforward comparison, but FWIW the minimum recommended t for
Argon2i is in fact 3, but for Argon2d it is 1.  (This is for TMTO
resilience reasons.)  Maybe we should be benchmarking them at t=3 and
t=1, respectively, going forward.

> REF
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i
> Will run 8 OpenMP threads
> Benchmarking: argon2i [AVX]... (8xOMP)
> memory per hash : 100.00 kB
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100
> Raw:    9216 c/s real, 1160 c/s virtual
> 
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d
> Will run 8 OpenMP threads
> Benchmarking: argon2d [AVX]... (8xOMP)
> memory per hash : 100.00 kB
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100
> Raw:    10624 c/s real, 1336 c/s virtual
> 
> OPT
> 
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i
> Will run 8 OpenMP threads
> Benchmarking: argon2i [AVX]... (8xOMP)
> memory per hash : 100.00 kB
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100
> Raw:    24064 c/s real, 3019 c/s virtual
> 
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d
> Will run 8 OpenMP threads
> Benchmarking: argon2d [AVX]... (8xOMP)
> memory per hash : 100.00 kB
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100
> Raw:    27008 c/s real, 3418 c/s virtual

Nice speeds for presumably SIMD-less code, but please note that both of
your benchmarks above (REF and OPT) say AVX.  Are they lying?

> but I was testing these no-sse versions by modyfiyng my code, don't
> know if I can just turn-off simd (?), so I can't be sure of these
> results although I know that structure of REF is different than
> OPT-SSE one(maybe more) function was called a different number of time

I'm sorry, but I find your wording above confusing.  So let me try to
ask a clarifying question:

Are you reviewing the generated assembly code?  It's trivial to see if
the code is using SIMD or not.

And while we're at it:

How are you obtaining the assembly code for review?  Do you replace
gcc's "-c" option with "-S"?  Or do you use "objdump -d" on the .o file?

> GPU
> 
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i-opencl
> Benchmarking: argon2i-opencl [Blake2 OpenCL]...
> memory per hash : 100.00 kB
> Device 0: GeForce GTX 960M
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100, cost 3 (l) of 1
> Many salts:     11070 c/s real, 11145 c/s virtual
> Only one salt:  11299 c/s real, 11299 c/s virtual
> 
> none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d-opencl
> Benchmarking: argon2d-opencl [Blake2 OpenCL]...
> memory per hash : 100.00 kB
> Device 0: GeForce GTX 960M
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 100, cost 3 (l) of 1
> Many salts:     13884 c/s real, 13884 c/s virtual
> Only one salt:  13884 c/s real, 13768 c/s virtual

I wonder if the faster GPU's such as in "super" will outperform a CPU at
this.  They should, as Argon2 isn't particularly GPU-resistant (except
for needing 1 KB of preferably local or private memory for the state).

> this is version of argon2 from github
> https://github.com/khovratovich/Argon2 and I have some questions and
> comments
> 
> here in version argon2i
> https://github.com/khovratovich/Argon2/blob/master/Argon2i/opt-sse/argon2i-opt-sse.cpp
> is text "Argon2d optimized implementation"
> 
> and "SSE3" but it's SSSE3 in blake2
> https://github.com/khovratovich/Argon2/blob/master/Argon2i/opt-sse/blake2b.cpp

SSE3 is generally useless for crypto (adds nothing that we'd use and
that wasn't already in SSE2).  So any mention of SSE3 in this context is
almost certainly a typo of SSSE3.

SSSE3 is useful in that it adds the PSHUFB instruction, available via
the _mm_shuffle_epi8() intrinsic.  BLAKE2, during its design, had its
rotate counts deliberately adjusted such that this instruction would be
usable to implement them.  (This change between BLAKE and BLAKE2 is
similar to an equivalent change between Salsa20 and ChaCha.)

In blake2b-round.h currently in jumbo, we actually see uses of SSSE3:

#ifndef __XOP__
#ifdef __SSSE3__
#define _mm_roti_epi64(x, c) \
    (-(c) == 32) ? _mm_shuffle_epi32((x), _MM_SHUFFLE(2,3,0,1))  \
    : (-(c) == 24) ? _mm_shuffle_epi8((x), r24) \
    : (-(c) == 16) ? _mm_shuffle_epi8((x), r16) \
    : (-(c) == 63) ? _mm_xor_si128(_mm_srli_epi64((x), -(c)), _mm_add_epi64((x), (x)))  \
    : _mm_xor_si128(_mm_srli_epi64((x), -(c)), _mm_slli_epi64((x), 64-(-(c))))

This comes straight from BLAKE2 designers' code.

Argon2's bundled BLAKE2 code is essentially the same:

https://github.com/khovratovich/Argon2/blob/master/Argon2i/opt-sse/blake2b-round.h

So yes, it uses XOP when available, and when not then it uses SSSE3 when
available.  (XOP is superior to SSSE3.)

> even SSE4_1 but I don't know if only one instruction
> blake2b-round.h:#define LOADU(p)  _mm_loadu_si128( (__m128i *)(p) )
> can make that it's SSE4_1 version (?)

As magnum pointed out, this instruction doesn't require SSE4.1 at all.

My guess is that they use the unaligned load instructions only when
SSE4.1 is enabled because actual CPUs with SSE4.1 or better tend to
offer those instructions "for free" (no performance overhead), whereas
many older ones don't (have performance overhead).  So it could be a
"just in case" thing - if S->buf somehow isn't naturally aligned, but
the build is for SSE4.1 or better, then it will work anyway.  We could
ask Samuel Neves to clarify this.

> files .cpp are with header and I added that I modified these files but
> files .h are without header and I don't know what to do with these,
> even part of blake2 is without header
> https://github.com/khovratovich/Argon2/blob/master/Argon2i/ref/blake-round.h

As magnum said, we should have the PHC formats use a shared BLAKE2
implementation already in our tree, unless/until there are reasons to
include custom BLAKE2 code with the formats.

... and that reason could be the 1.2 revision of Argon2 using BlaMka, a
modification of BLAKE2 round.

If a source file is lacking a comment on who wrote it, you may add a
comment like that describing where you got the file, under what license,
and what you changed.  Preferably do that with a separate commit (first
commit the unmodified file).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.