Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 15 Mar 2015 01:02:55 +0300
From: Solar Designer <>
Subject: Re: interleaved bitslice?

On Sat, Mar 14, 2015 at 10:49:30PM +0100, magnum wrote:
> On 2015-03-14 20:01, Solar Designer wrote:
> > On Thu, Mar 12, 2015 at 08:37:01AM +0100, magnum wrote:
> >> On 2015-03-11 21:55, Solar Designer wrote:
> >>> solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native
> >>>
> >>> This gave "warning: always_inline function might not be inlinable" about
> >>> FF(), I(), H(), F(), add32r(), add32c(), add32() - but then it built
> >>> fine.  The speed is:
> >>
> >> Solar,
> >>
> >> While experimenting with this I noticed using a vector size of 32 but
> >> still compiling for AVX gave a slight boost (~5%). I assume this ends up
> >> similar to the interleaving we use in Jumbo, and is faster for the same
> >> reasons.
> > 
> > I've just tested this with gcc 4.9.2 on Linux, and the generated code is
> > "floating-point" 256-bit AVX.  So this is not interleaving.
> And yet it's faster? I would not have guessed that ever but I suppose
> it's not that special. I'm probably the guy least using floating point
> in the entire world.

Like I mentioned, 128-bit vs. 256-bit AVX is roughly same speed per-bit
on current CPUs, for bitslice DES at least.  So your 5% speed
difference, in either direction, is realistic.

We should really try 256-bit AVX2 on the bitslice DES code.  It might be
way faster on Haswell, based on other benchmarks so far.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.