Date: Thu, 18 Oct 2012 21:49:29 +0530 From: Sayantan Datta <std2048@...il.com> To: john-dev@...ts.openwall.com Subject: Re: bf-opencl vectorization (was: bf-opencl fails self-test on CPU) Hi, On Thu, Oct 18, 2012 at 9:02 AM, Solar Designer <solar@...nwall.com> wrote: > Sayantan - > > On Tue, Oct 16, 2012 at 11:31:15PM +0530, Sayantan Datta wrote: > > On Mon, Aug 13, 2012 at 9:41 AM, Solar Designer <solar@...nwall.com> > wrote: > > > > > BTW, is there any way to target future Intel CPUs (those with AVX2) > > > with Intel's OpenCL SDK and see if this kernel would be vectorized > then? > > > Of course, we won't be able to run it yet, except maybe on their SDE. > > > > I was looking for opencl cpu optimizations targeting sse but couldn't > get > > a proper answer. So should I try vectorizing the bf kernel for cpu? If > I'm > > targeting sse then what should be the vector length? > > You mean if you're targeting the future AVX2, right? There's no gather > addressing in CPUs before AVX2 becomes available next year (or on Intel > SDE now). I'd expect that uint4 or uint8 would be right. We'll be > limited by 32 KB L1 data cache on the first CPUs to support AVX2, so > this means at most 8 bcrypt instances per CPU core. AVX2 vectors are > 256-bit, which gives us 8 too. I don't know if two uint4's interleaved > (by the compiler) or one uint8 would be more efficient. Also, > auto-vectorization of your existing bf-opencl code might just work (when > targeting AVX2). > > Anyhow, even if you implement and try this now, you'd at best be able to > take a look at the generated code and test that it works in SDE (but is > very slow, as expected for emulation). You will have little idea of how > fast it'd run on the actual CPUs. So this is not worth a lot of your > time now. I was merely curious whether the code, as-is, would be > getting auto-vectorized for AVX2 or not - in other words, whether we > have something to try as soon as suitable CPUs become available or not. > > As to trying out explicit vectorization in bf-opencl on GPU, you may. > I doubt that we'll see much or any performance increase from this, but > feel free to try. > > Thanks, > > Alexander > As a matter of fact I did try doing two hashes per kernel . Opencl seems to auto vectorize the code if the instructions are mixed at BF_ENCRYPT() level. However if we mix at BF_ROUND() level the results are poor. I could achieve nearly than 2900 c/s on intel core i5 2500 with opencl. For comparison i5 2500 does 3600 c/s native openmp at stock speeds. However there wasn't much improvement for fx-8120. Maybe I could try doing four hashes per kernel later. I already posted the patch to git repo yesterday. Regards, Sayantan Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.