Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 8 Sep 2015 11:47:25 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: SHA-1 H()

Lei,

On Tue, Sep 08, 2015 at 03:04:57PM +0800, Lei Zhang wrote:
> On Sep 2, 2015, at 11:20 PM, Solar Designer <solar@...nwall.com> wrote:
> > 
> > Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
> > fixes, please?
> 
> On AltiVec (4xOMP):

Is this 4 threads likely across different CPU cores?  That's no good.
What we need for benchmarking is the maximum number of threads supported
in hardware on a certain number of CPU cores (on 1 core is OK if you
can't reliably use the entire machine's cores).  So on POWER8 I guess
you'll run 8 threads all locked to one physical CPU core.  You should be
able to do that with OpenMP env vars (affinity).

Please also run non-OpenMP benchmarks (thus, using 1 thread on 1 core
only) for reference.

> [before]
> pbkdf2-sha1:	35840 c/s real, 8982 c/s virtual
> pbkdf2-sha256:	14194 c/s real, 3566 c/s virtual
> pbkdf2-sha512:	5944 c/s real, 1489 c/s virtual
> 
> [after]
> pbkdf2-sha1:	36141 c/s real, 9057 c/s virtual
> pbkdf2-sha256:	14336 c/s real, 3592 c/s virtual
> pbkdf2-sha512:	5936 c/s real, 1498 c/s virtual

Thanks, but why are you testing these 3 hash types?  I think we made
relevant changes to SHA-1 (optimized H using vcmov() as discussed in
this thread), MD5 (ditto, using my newly found expression for I), and
MD4 (ditto, realizing that G is the same as SHA-2 Maj).

We also revised how vcmov() is emulated and what we do when it is
emulated, but this should not affect AltiVec and NEON because those have
non-emulated vcmov().  We also adjusted SHA-256's interleaving factor on
XOP, but that's just XOP.

There should be no change to SHA-256 and SHA-512 on AltiVec and NEON.

> On NEON (2xOMP):
> 
> [before]
> pbkdf2-sha1:	578 c/s real, 289 c/s virtual
> pbkdf2-sha256:	276 c/s real, 138 c/s virtual
> pbkdf2-sha512:	125 c/s real, 62.7 c/s virtual
> 
> [after]
> pbkdf2-sha1:	501 c/s real, 250 c/s virtual
> pbkdf2-sha256:	276 c/s real, 138 c/s virtual
> pbkdf2-sha512:	125 c/s real, 62.7 c/s virtual
> 
> There's no significant change on Altivec,

OK, but you need to run 8 threads/core benchmarks.

> while SHA1 somehow gets slower on NEON.

It might need higher interleaving factor now.  You haven't even tried
introducing interleaving for these archs, have you?  (I don't recall.)

I think AltiVec probably won't need interleaving if we target modern
POWER chips with multiple hardware threads per core, but NEON will.

Also, as I suggested in the "MD5 on XOP, NEON, AltiVec" thread:

"[...] we'll need to revise MD5_I in simd-intrinsics.c to use [...]
the obvious expression with OR-NOT on NEON and AltiVec (IIRC, those
archs have OR-NOT, which might be lower latency than select)."

I think you should do that before benchmarking and before tuning of the
interleaving factors for MD5.

Thanks again,

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ