john-dev - Re: bitslice MD*/SHA*, AVX2

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150311212102.GA20481@openwall.com>
Date: Thu, 12 Mar 2015 00:21:02 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice MD*/SHA*, AVX2

On Wed, Mar 11, 2015 at 11:55:09PM +0300, Solar Designer wrote:
> There's definitely some further room for optimization in md5slice.c -
> for example, it does not yet take advantage of the common subexpressions
> in the invocations of HH() (we can save 8*32 XORs).

BTW, it appears that we don't currently use this optimization.  Not even
in the copy of my portable MD5 implementation in JtR, which I've since
updated to include this optimization here:

http://openwall.info/wiki/people/solar/software/public-domain-source-code/md5

In my testing, this might not be beneficial on 2-operand archs such as
plain x86, but it should be on 3-operand archs such as AVX.  So we
should update the code in sse-intrinsics.c, and benchmark.  And we should
update the plain C code anyway, such as for non-x86 archs (which are
mostly 3-operand RISC).

magnum, Jim?

> FF() has everything
> merged into it manually and specialized, whereas GG(), HH(), and II()
> use other functions.  Also, explicit (such as via cpp or with a script)
> unrolling and inlining might (or might not) work better than gcc's (the
> various *ptr++'s will be replaced with explicit array accesses at fixed
> indices, which I think gcc does for us anyway).

BTW, the bitselect instructions on XOP and GCN, and the ternary logic
instructions on AVX-512 and Maxwell are usable for both straightforward
and bitslice implementations, but it might be tough to come up with
their optimal usage for the latter.  We get a large circuit that we need
to map onto those bitselect's or LUT-3's optimally.  In straightforward
implementations, we simply see which portions of the MD*/SHA* basic
functions fit bitselect's or LUT-3's.  With bitslice implementations, we
can do the same, but it is also possible that portions of logic coming
from ADDs or crossing boundaries between former bitwise ops and ADDs
(even across former bit rotates) fit bitselect's or LUT-3's.  In some
cases, there might be tradeoffs here - e.g., don't fully put a basic
MD*/SHA* function into a LUT-3, but instead put a portion of it
intermixed with nearby logic coming from ADDs into two LUT-3's.  Or even
add redundant extra logic to make the whole thing fit LUT-3's better.
In other words, a greedy algorithm putting everything into the most
capable instructions sequentially might not produce the best results.
I did not look into this closely yet, so this is pure speculation on
what optimization opportunities and challenges we might face.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.