john-dev - Re: Proposed optimizations to pwsafe

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJpaVcQ0f9NNMmDT2M7r0Kc1TzRMYC-OFmdywZOR_QvOcYpPsA@mail.gmail.com>
Date: Wed, 30 Jan 2013 05:46:01 -0500
From: Brian Wallace <nightstrike9809@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Proposed optimizations to pwsafe

I cleaned up the OpenCL and CUDA code, with some additional optimizations.
 The optimizations were not getting the performance I want, so I started on
trying to convert to SIMD code.  So far I have it working at about a 1/6 of
the speed of the latest optimized, and I'm not really sure why.  If anyone
can give me a heads up on any obvious bottle necks, please let me know.

Here are the benchmarks I grabbed from the bleeding-jumbo branch of my fork.

Device 1: Tahiti (AMD Radeon HD 7900 Series)
Local worksize (LWS) 64, Global worksize (GWS) 57344
Benchmarking: Password Safe SHA-256 [OpenCL]... DONE
Raw: 472615 c/s real, 17203K c/s virtual

Benchmarking: Password Safe SHA-256 [CUDA]... DONE
Raw: 129590 c/s real, 128862 c/s virtual

I'll do a pull request before I start committing the SIMD to my fork.

On Mon, Jan 28, 2013 at 4:26 PM, Brian Wallace <nightstrike9809@...il.com>wrote:

> I'm going to try and replace ror with rotate calls, but it seems to
> require some type conversions.  I'm doing a bit of reading up on OpenCL dev
> to fix any issues and hopefully get more c/s.
>
>
> On Mon, Jan 28, 2013 at 1:55 PM, magnum <john.magnum@...hmail.com> wrote:
>
>> Brian,
>>
>> After your OpenCL patch I get these warnings from pwsafe-opencl:
>>
>> Build log: <program source>:282:36: warning: signed shift result
>> (0x200000000) requires 35 bits to represent, but 'int' only has 32 bits
>>                 w[14] = sigma1( w[12] ) + w[7] + sigma0( 256 );
>>                                                  ^~~~~~~~~~~~~
>> <program source>:21:21: note: expanded from macro 'sigma0'
>> #define sigma0(x) ((ror(x,7))  ^ (ror(x,18)) ^ (x>>3))
>>                     ^
>> <program source>:16:33: note: expanded from macro 'ror'
>> #define ror(x,n) ((x >> n) | (x << (32-n)))
>>                               ~ ^  ~
>> <program source>:615:35: warning: signed shift result (0x200000000)
>> requires 35 bits to represent, but 'int' only has 32 bits
>>         w[14] = sigma1( w[12] ) + w[7] + sigma0( 256 );
>>                                          ^~~~~~~~~~~~~
>> <program source>:21:21: note: expanded from macro 'sigma0'
>> #define sigma0(x) ((ror(x,7))  ^ (ror(x,18)) ^ (x>>3))
>>                     ^
>> <program source>:16:33: note: expanded from macro 'ror'
>> #define ror(x,n) ((x >> n) | (x << (32-n)))
>>                               ~ ^  ~
>>
>>
>> It passes self-test though. Even the Test Suite passes IIRC. So maybe
>> this is harmless? But we should still get rid of the warnings.
>>
>> Note that in the bleeding branch, compiler warnings are always shown. In
>> unstable, you need to -DREPORT_OPENCL_WARNINGS or -DDEBUG for them to show
>> up (as long as there are only warnings).
>>
>> magnum
>>
>>
>>
>> On 28 Jan, 2013, at 2:09 , Brian Wallace <nightstrike9809@...il.com>
>> wrote:
>>
>> When I applied the opencl optimization, I only saw minor improvements
>> compared to the CUDA improvements.  I found that was kind of weird, because
>> it was basically the same changes to the code.
>>
>> On Sun, Jan 27, 2013 at 7:58 PM, magnum <john.magnum@...hmail.com> wrote:
>>
>>> On 28 Jan, 2013, at 1:41 , Solar Designer <solar@...nwall.com> wrote:
>>> > On Sun, Jan 27, 2013 at 07:22:19PM -0500, Brian Wallace wrote:
>>> >> Ok, I'll do those changes.  I haven't done much cuda/ocl coding in the
>>> >> past, so it might take me a short while to get up to speed on what
>>> works
>>> >> best, although I have a good background in C and hash cracking
>>> >> optimization.  What kind of benchmarks are we getting on
>>> pwsafe-opencl vs
>>> >> hashcat.
>>> >
>>> > Apparently, hashcat's speed is ~500k on HD 7970.  hashkill is at ~480k:
>>> >
>>> > http://twitter.com/gat3way/status/294968226209726464/photo/1
>>> >
>>> > We're getting 355k:
>>> >
>>>
>>> > (The match of OpenCL and CUDA speed is curious.  I did not tune THREADS
>>> > and BLOCKS in cuda_pwsafe.h, and was compiling for the default of
>>> sm_10.
>>> > Perhaps better speed is possible with some tuning.)
>>>
>>> The OpenCL format currently only auto-tunes local work-size (THREADS) so
>>> it too runs at suboptimal conditions. The global work-size defauls to the
>>> same figure the CUDA format use. It does support LWS/GWS environment
>>> variables though:
>>>
>>> $ GWS=$((256*1024)) ../run/john -t -fo:pwsafe-opencl -plat=1
>>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>>> Device 0: Tahiti (AMD Radeon HD 7900 Series)
>>> Local worksize (LWS) 64, Global worksize (GWS) 262144
>>> Benchmarking: Password Safe SHA-256 [OpenCL]... DONE
>>> Raw:    362411 c/s real, 78643K c/s virtual
>>>
>>> No huge difference though.
>>>
>>> In bleeding, Claudio has added a shared function for tuning GWS. I
>>> haven't had time to try it out yet.
>>>
>>> magnum
>>>
>>
>>
>>
>

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.