Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 8 Jul 2012 12:34:51 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Rotate and bitselect investigation

On Sun, Jul 8, 2012 at 11:58 AM, Solar Designer <solar@...nwall.com> wrote:

> Sayantan,
>
> On Sun, Jul 08, 2012 at 10:51:54AM +0530, Sayantan Datta wrote:
> > I have investigated the rotate and bitselct issue on 7970.
>
> Thank you!
>
> > Both type of rotate(manual and inbuilt opencl function) use bitalign
> > instruction.  I investigated using rotate(x,(uint)30) and ((x << 30) |
> ((x
> > ) >> 2)). Also the values loaded in the bitalign instructions are exactly
> > same except they  operate on different registers. So you won't see any
> > performance increase in this case.
>
> Sounds good.
>
> > However with bitselct the situation is different. The inbuilt function
> uses
> > an alien bfi instruction
>
> This is precisely what we expected. :-)
>
> > which I couldn't find anywhere in the docs. The
> > manual version uses ixor and iand.
>
> So, any explanation why there's no measurable speedup (at least in my
> tests) from using bitselect() in SHA-1's F in MSCash2?  Is there some
> kind of stall, so that the reduction in instruction count doesn't help?
> Or is there somehow no such reduction (e.g., an extra move is added)?
>
> Alexander
>

Hi Alexander,

There was a small typo in the optimized kernel. Apparently you didn't
change from manual to bitselect in the SHA1_digest() function which is
called in the 10K iteration.
Here's the new results with 1/4 of the KPC you used in the previous
benchmark.

std2048@...l:~/bin/run$ ./john -te -fo=mscash2-opencl
OpenCL platform 0: NVIDIA CUDA, 1 device(s).
Using device 0: GeForce GTX 570
Compilation log:
ptxas info    : Compiling entry function 'PBKDF2' for 'sm_20'
ptxas info    : Function properties for PBKDF2
 64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 59 registers, 160+0 bytes smem, 52 bytes cmem[0], 4
bytes cmem[16]
Optimal Work Group Size:128
Kernel Execution Speed (Higher is better):0.484821
Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
Raw:    30481 c/s real, 30481 c/s virtual

std2048@...l:~/bin/run$ ./john -te -fo=mscash2-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Optimal Work Group Size:256
Kernel Execution Speed (Higher is better):1.549856
Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
Raw:    99801 c/s real, 99296 c/s virtual

Regards,
Sayantan

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.