john-dev - Re: interleaved bitslice?

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150314220255.GA16464@openwall.com>
Date: Sun, 15 Mar 2015 01:02:55 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: interleaved bitslice?

On Sat, Mar 14, 2015 at 10:49:30PM +0100, magnum wrote:
> On 2015-03-14 20:01, Solar Designer wrote:
> > On Thu, Mar 12, 2015 at 08:37:01AM +0100, magnum wrote:
> >> On 2015-03-11 21:55, Solar Designer wrote:
> >>> solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native
> >>>
> >>> This gave "warning: always_inline function might not be inlinable" about
> >>> FF(), I(), H(), F(), add32r(), add32c(), add32() - but then it built
> >>> fine.  The speed is:
> >>
> >> Solar,
> >>
> >> While experimenting with this I noticed using a vector size of 32 but
> >> still compiling for AVX gave a slight boost (~5%). I assume this ends up
> >> similar to the interleaving we use in Jumbo, and is faster for the same
> >> reasons.
> > 
> > I've just tested this with gcc 4.9.2 on Linux, and the generated code is
> > "floating-point" 256-bit AVX.  So this is not interleaving.
> 
> And yet it's faster? I would not have guessed that ever but I suppose
> it's not that special. I'm probably the guy least using floating point
> in the entire world.

Like I mentioned, 128-bit vs. 256-bit AVX is roughly same speed per-bit
on current CPUs, for bitslice DES at least.  So your 5% speed
difference, in either direction, is realistic.

We should really try 256-bit AVX2 on the bitslice DES code.  It might be
way faster on Haswell, based on other benchmarks so far.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.