john-dev - Re: optimized mscash2-opencl

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+TsHUD2W_cVu7S6QwXYr7gDWQuuC7rjB6b_Q-XdJZ4X7DwhSA@mail.gmail.com>
Date: Sat, 7 Jul 2012 17:14:52 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimized mscash2-opencl

On Sat, Jul 7, 2012 at 3:01 PM, Solar Designer <solar@...nwall.com> wrote:

> Sayantan, magnum -
>
> I was puzzled by the fact that changing the "manual" rotates to rotate()
> in pbkdf2_kernel.cl made it twice slower on HD 7970 (at least).  Today I
> looked into this.  It turns out that Sayantan's version of the code
> heavily relied on the compiler doing some non-trivial optimizations,
> including figuring out that two of the four SHA-1 computations could be
> moved out of the 10k-iterations loop.  Somehow the attempted change to
> use rotate() was just enough to prevent that specific optimization.
>
> Anyway, I've optimized the code to avoid relying on the compiler doing
> this, and I made several other optimizations as well.  In the current
> pbkdf2_kernel.cl the uses of rotate() and bitselect() no longer result
> in any slowdown; however, they still don't result in any speedup as
> well, which is puzzling.  Was the compiler good enough to generate the
> proper instructions anyway? or does it still not do that?  We need to
> examine the code to find out - at least IL if not native.  Sayantan -
> this is now a task for you.
>
> Before optimizations:
>
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal Work Group Size:256
> Kernel Execution Speed (Higher is better):1.403122
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw:    92304 c/s real, 92467 c/s virtual
>
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> Optimal Work Group Size:512
> Kernel Execution Speed (Higher is better):0.416847
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw:    26900 c/s real, 26900 c/s virtual
>
> After:
>
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal Work Group Size:256
> Kernel Execution Speed (Higher is better):1.492774
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw:    97814 c/s real, 97632 c/s virtual
>
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> Optimal Work Group Size:128
> Kernel Execution Speed (Higher is better):0.491235
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw:    31852 c/s real, 31813 c/s virtual
>
> This is +6% on AMD and +18% on NVIDIA.
>
> Actual run:
>
> $ ./john -i=alpha ~/john/contest-2011/hashes-all.txt-1.mscash2
> -fo=mscash2-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal Work Group Size:128
> Kernel Execution Speed (Higher is better):1.492764
> Loaded 1152 password hashes with 1090 different salts (M$ Cache Hash 2
> (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL])
> guesses: 0  time: 0:00:00:12 0.00%  c/s: 21260  trying: bara - choedia
> guesses: 0  time: 0:00:00:15 0.00%  c/s: 52393  trying: bara - choedia
> salart           (gemignani)
> guesses: 1  time: 0:00:00:17 0.00%  c/s: 59208  trying: bara - choedia
> starter          (bevilaqua)
> guesses: 2  time: 0:00:00:23 0.00%  c/s: 68177  trying: bara - choedia
> guesses: 2  time: 0:00:01:43 0.00%  c/s: 96292  trying: bara - choedia
> moones           (alexino)
> guesses: 3  time: 0:00:02:02 0.00%  c/s: 96526  trying: bara - choedia
> guesses: 3  time: 0:00:03:14 0.00%  c/s: 99710  trying: bara - choedia
> assica           (bersamina)
> mingui           (abisheva)
> guesses: 5  time: 0:00:03:54 0.00%  c/s: 101588  trying: bara - choedia
> annico           (boediman)
> guesses: 6  time: 0:00:04:21 0.00%  c/s: 101194  trying: bara - choedia
> stephat          (bamigboye)
> guesses: 7  time: 0:00:04:56 0.00%  c/s: 100800  trying: bara - choedia
> storine          (arient)
> guesses: 8  time: 0:00:05:39 0.00%  c/s: 100352  trying: bara - choedia
> aritta           (chamieh)
> streles          (aquinde)
> monies           (bercasio)
> merrate          (figuera)
> meless           (fiander)
> starine          (clavier)
> stomara          (elhadidi)
> stronie          (elizan)
> shoria           (daveii)
> mistom           (bhuriwale)
> alamel           (deblasis)
> ashame           (bareis)
> arandy           (ghazalie)
> samali           (baubie)
> stronia          (binduhewa)
> metale           (bazier)
> mereko           (aleksi)
> guesses: 25  time: 0:00:12:01 0.00%  c/s: 101062  trying: bara - choedia
> stramos          (empabido)
> artico           (fallangie)
> ashona           (estacion)
> arishi           (elvina)
> sherie           (dilawer)
> andrin           (alawieh)
> guesses: 31  time: 0:00:17:06 0.00%  c/s: 101869  trying: bara - choedia
> artie            (heilemann)
> merens           (heinzmann)
> standan          (gilead)
> artal            (adrienne)
> anness           (beccaria)
> guesses: 36  time: 0:00:18:32 0.00%  c/s: 101776  trying: bara - choedia
> shomos           (basie)
> mandia           (artillery)
> annane           (azizieh)
> guesses: 39  time: 0:00:19:23 0.00%  c/s: 101600  trying: bara - choedia
> stepine          (hemmati)
> guesses: 40  time: 0:00:20:27 0.00%  c/s: 101404  trying: bara - choedia
> sarone           (bangie)
> ashoon           (abhulimen)
> storten          (akinremi)
> misamo           (gravelin)
> guesses: 44  time: 0:00:21:03 0.00%  c/s: 101483  trying: bara - choedia
> stepand          (egnario)
> guesses: 45  time: 0:00:21:48 0.00%  c/s: 101356  trying: bara - choedia
>
> Default Adapter - AMD Radeon HD 7900 Series
>                   Sensor 0: Temperature - 86.00 C
>
> Default Adapter - AMD Radeon HD 7900 Series
>                             Core (MHz)    Memory (MHz)
>            Current Clocks :    925           1375
>              Current Peak :    925           1375
>   Configurable Peak Range : [300-1125]     [150-1575]
>                  GPU load :    98%
>
> Alexander
>

Guess I didn't had much deeper insight into the codes which prevented me
from moving the two SHA1 from the 10K loops. BTW I was expecting much more
performace, nearly double on 7970 becuse the two SHA1 represented almost
half of the total computation.  This could mean we are using a lot of
global memory which I will check too. With this patch we are at par with
hashcat on 570 but stll lagging behind on 7970. Regarding bitselct and
roatae it didn't even work properly when I applied them first time. So I
reverted them.Anyway I'll look into the bitselect and rotate case.

Regards,
Sayantan.

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.