john-dev - Re: bf-opencl vectorization (was: bf-opencl fails self-test on CPU)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+TsHUC2=v74xZUMjSE7VXhodjRfpjEf3_6N=1fJbNhRFV3reA@mail.gmail.com>
Date: Thu, 18 Oct 2012 21:49:29 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: bf-opencl vectorization (was: bf-opencl fails
 self-test on CPU)

Hi,

On Thu, Oct 18, 2012 at 9:02 AM, Solar Designer <solar@...nwall.com> wrote:

> Sayantan -
>
> On Tue, Oct 16, 2012 at 11:31:15PM +0530, Sayantan Datta wrote:
> > On Mon, Aug 13, 2012 at 9:41 AM, Solar Designer <solar@...nwall.com>
> wrote:
> >
> > > BTW, is there any way to target future Intel CPUs (those with AVX2)
> > > with Intel's OpenCL SDK and see if this kernel would be vectorized
> then?
> > > Of course, we won't be able to run it yet, except maybe on their SDE.
> >
> > I was looking for opencl cpu optimizations targeting  sse but couldn't
> get
> > a proper answer. So should I try vectorizing the bf kernel for cpu? If
> I'm
> > targeting sse  then what should be the vector length?
>
> You mean if you're targeting the future AVX2, right?  There's no gather
> addressing in CPUs before AVX2 becomes available next year (or on Intel
> SDE now).  I'd expect that uint4 or uint8 would be right.  We'll be
> limited by 32 KB L1 data cache on the first CPUs to support AVX2, so
> this means at most 8 bcrypt instances per CPU core.  AVX2 vectors are
> 256-bit, which gives us 8 too.  I don't know if two uint4's interleaved
> (by the compiler) or one uint8 would be more efficient.  Also,
> auto-vectorization of your existing bf-opencl code might just work (when
> targeting AVX2).
>
> Anyhow, even if you implement and try this now, you'd at best be able to
> take a look at the generated code and test that it works in SDE (but is
> very slow, as expected for emulation).  You will have little idea of how
> fast it'd run on the actual CPUs.  So this is not worth a lot of your
> time now.  I was merely curious whether the code, as-is, would be
> getting auto-vectorized for AVX2 or not - in other words, whether we
> have something to try as soon as suitable CPUs become available or not.
>
> As to trying out explicit vectorization in bf-opencl on GPU, you may.
> I doubt that we'll see much or any performance increase from this, but
> feel free to try.
>
> Thanks,
>
> Alexander
>

As a matter of fact I did try doing two hashes per kernel . Opencl seems to
auto vectorize the code if the instructions are mixed at BF_ENCRYPT()
level. However if we mix at BF_ROUND() level the results are poor. I could
achieve nearly than 2900 c/s on intel core i5 2500 with opencl. For
comparison i5 2500 does 3600 c/s native openmp at stock speeds. However
there wasn't much improvement for fx-8120. Maybe I could try doing four
hashes per kernel later. I already posted the patch to git repo yesterday.


Regards,
Sayantan

Content of type "text/html" skipped

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.