Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 3 Sep 2015 21:40:03 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: SHA-1 H()

On Thu, Sep 03, 2015 at 11:52:47AM +0200, magnum wrote:
> On 2015-09-03 06:56, Solar Designer wrote:
> >On Wed, Sep 02, 2015 at 09:31:34PM +0200, magnum wrote:
> >>#define Ch(x, y, z) (z ^ (x & (y ^ z)))
> >>#define Ch(x, y, z) ((x & y) ^ ( (~x) & z))
> >>
> >>This is 3 vs. 4 ops, right?
> >
> >On archs without AND-NOT, yes.  So it's a good find, and I'm happy you
> >patched these.
> >
> >However, on archs with AND-NOT either is 3 ops, and the one with AND-NOT
> >has some parallelism.
> 
> Maybe the and-not one is better on some GPU then? I need to test. 

Yes, that's possible.

> Apparently GCN has ANDN and NAND.

I need to take a fresh look at the arch manual, but in the generated
code I only see scalar ANDN, and never vector ANDN (nor NAND).  They
defined scalar ANDN presumably because it's so useful for exec masks.

I see you've committed this:

+#if cpu(DEVICE_INFO) || amd_gcn(DEVICE_INFO)
+#define HAVE_ANDNOT 1
+#endif

but I think the check for amd_gcn(DEVICE_INFO) is wrong.

And why this change? -

-#if !gpu_nvidia(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)
+#if !gpu_nvidia(DEVICE_INFO)
 #define USE_BITSELECT 1
 #elif gpu_nvidia(DEVICE_INFO)
 #define OLD_NVIDIA 1
 #endif

> >Maybe both forms of emulation need to be kept in pseudo_intrinsics.h
> >with a way for us to choose one or the other.  It might happen that the
> >optimal choice will vary by arch, CPU, compiler, format.
> 
> But if it varies by format, we need to decide outside pseudo_intrinsics.h.

We could include several versions of the macro in pseudo_intrinsics.h
and decide in the format via setting another macro (WANT_XXX) before
including pseudo_intrinsics.h.

> BTW early tests indicate that 5916a57 made SHA-512 very slightly worse 
> (but almost hidden by normal variations).

On what hardware?

The parallelism vs. register pressure tradeoff is in fact non-obviously
beneficial.  But on XOP there should be speedup from doing 1 op fewer.

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ