john-dev - Re: PHC: Lyra2 vs yescrypt benchmarks 2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150726162531.GA2368@openwall.com>
Date: Sun, 26 Jul 2015 18:25:31 +0200
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Lyra2 vs yescrypt benchmarks 2

On Sun, Jul 26, 2015 at 05:27:29PM +0200, Agnieszka Bielec wrote:
> 2015-07-26 16:11 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > On Sun, Jul 26, 2015 at 04:02:19PM +0200, Agnieszka Bielec wrote:
> >> my undestranding was that when intrinsic are in SSE2 but AVX is
> >> avalialbe they are compiled to AVX,
> >
> > Yes, although that's only part of the story.
> 
> sometimes compilers can mix AVX and SSE2 (I am not sure)

(At least) Intel CPUs impose a huge penalty on this (when operating on
the same XMM registers), so compilers are supposed to try and avoid it.

> >> and similarly SSE4 -> AVX2
> >
> > No.  This indicates that you don't understand what SSE4 is, and what
> > AVX2 is.  Please read up on them, then answer my question again.
> 
> I read some stuff,
> 
> SSE4 is an additional some instructions to SSE and like SSE2 makes
> computations on 128-bits registers MMX (4 integers 32-bit)

Yes, except those registers are called XMM, not MMX.  MMX is an older
SIMD instruction set, pre-dating SSE, which uses 64-bit vectors (and the
corresponding MMX registers).  So the name XMM was pun on the older MMX.

> AVX - 256-bit registers YMM0 -YMM15 (8 x 32-bits or 4 x 64-bits or
> even lower) , some on registers MMX, math operations in form c = a + b
> (in sse a=a+b)

Yes.  Most important for us is that AVX introduced 3-operand
instructions (VEX encoding) as you mention, allowing to specify a
separate destination register.  This eliminates the need for most MOV
instructions in typical code.

Obviously, the direct AVX equivalents of SSE* instructions - the kind of
equivalents that a compiler may substitute for the same intrinsics -
operate on 128-bit registers, and at x86 assembly code level they are
still called XMM and are logically the same as the XMM registers we're
used to with SSE* (but may be in a separate register file physically,
which is why the huge penalty for mixing SSE*/AVX insns operating on
"the same" XMM registers).  In this way, AVX works as an extension of
all of SSE* instruction sets to 3-operand form.

The 256-bit AVX instructions (and registers) appear to be useless for
our purposes so far, because they are mostly floating-point only, and
the few bitwise ones are slow.

> AVX2 - more instructions in 256-bit registers

AVX2 makes the 256-bitness actually useful to us, by providing integer
instructions and fast bitwise instructions.

When I ask whether some implementation uses AVX2, I mean whether it uses
those 256-bit integer and/or bitwise operations.  Any code that is 128-bit
only isn't really AVX2 even if you had AVX2 allowed during the build
(the compiler might use some AVX2 specifics here and there, but this
typically won't noticeably affect performance until the source code has
256-bit vectors).

For Lyra2, its designers reported getting something like a 30% speedup
with AVX2.  I don't know if that code is released.  Perhaps they simply
substituted a separately released implementation of BLAKE2 using AVX2.

For yescrypt, I tried using AVX2, but with the current pwxform defaults
the speedup on well was only ~6% and only for some of the cost settings.
That's because the S-box lookups remain 128-bit.  Much larger speedup is
possible with different pwxform settings (for wider S-boxes), which I
also experimented with on well, and which I may ask you to try
implementing later.  (I am interested how those would behave on GPUs.)

> > And you really ought to be reading assembly output to get a feeling of
> > this stuff.
> 
> I will spend some time on it today and tomorrow

OK.

I suggest that you try building php_mt_seed for SSE4.1, AVX, and AVX2 -
three separate builds - and examine how the code differs.

Unfortunately, it does not support SIMD at all when building for an
instruction set inferior to SSE4.1.  I have an unreleased version that
also has code for SSE2 - perhaps I should complete and release it soon,
at least to enable educational use like this.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.