john-dev - Re: SHA-1 H()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150908084725.GA10914@openwall.com>
Date: Tue, 8 Sep 2015 11:47:25 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: SHA-1 H()

Lei,

On Tue, Sep 08, 2015 at 03:04:57PM +0800, Lei Zhang wrote:
> On Sep 2, 2015, at 11:20 PM, Solar Designer <solar@...nwall.com> wrote:
> > 
> > Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
> > fixes, please?
> 
> On AltiVec (4xOMP):

Is this 4 threads likely across different CPU cores?  That's no good.
What we need for benchmarking is the maximum number of threads supported
in hardware on a certain number of CPU cores (on 1 core is OK if you
can't reliably use the entire machine's cores).  So on POWER8 I guess
you'll run 8 threads all locked to one physical CPU core.  You should be
able to do that with OpenMP env vars (affinity).

Please also run non-OpenMP benchmarks (thus, using 1 thread on 1 core
only) for reference.

> [before]
> pbkdf2-sha1:	35840 c/s real, 8982 c/s virtual
> pbkdf2-sha256:	14194 c/s real, 3566 c/s virtual
> pbkdf2-sha512:	5944 c/s real, 1489 c/s virtual
> 
> [after]
> pbkdf2-sha1:	36141 c/s real, 9057 c/s virtual
> pbkdf2-sha256:	14336 c/s real, 3592 c/s virtual
> pbkdf2-sha512:	5936 c/s real, 1498 c/s virtual

Thanks, but why are you testing these 3 hash types?  I think we made
relevant changes to SHA-1 (optimized H using vcmov() as discussed in
this thread), MD5 (ditto, using my newly found expression for I), and
MD4 (ditto, realizing that G is the same as SHA-2 Maj).

We also revised how vcmov() is emulated and what we do when it is
emulated, but this should not affect AltiVec and NEON because those have
non-emulated vcmov().  We also adjusted SHA-256's interleaving factor on
XOP, but that's just XOP.

There should be no change to SHA-256 and SHA-512 on AltiVec and NEON.

> On NEON (2xOMP):
> 
> [before]
> pbkdf2-sha1:	578 c/s real, 289 c/s virtual
> pbkdf2-sha256:	276 c/s real, 138 c/s virtual
> pbkdf2-sha512:	125 c/s real, 62.7 c/s virtual
> 
> [after]
> pbkdf2-sha1:	501 c/s real, 250 c/s virtual
> pbkdf2-sha256:	276 c/s real, 138 c/s virtual
> pbkdf2-sha512:	125 c/s real, 62.7 c/s virtual
> 
> There's no significant change on Altivec,

OK, but you need to run 8 threads/core benchmarks.

> while SHA1 somehow gets slower on NEON.

It might need higher interleaving factor now.  You haven't even tried
introducing interleaving for these archs, have you?  (I don't recall.)

I think AltiVec probably won't need interleaving if we target modern
POWER chips with multiple hardware threads per core, but NEON will.

Also, as I suggested in the "MD5 on XOP, NEON, AltiVec" thread:

"[...] we'll need to revise MD5_I in simd-intrinsics.c to use [...]
the obvious expression with OR-NOT on NEON and AltiVec (IIRC, those
archs have OR-NOT, which might be lower latency than select)."

I think you should do that before benchmarking and before tuning of the
interleaving factors for MD5.

Thanks again,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.