john-dev - Re: PHC: Lyra2 on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKGDhHUn4tOT_A_yPLnyVyDYBC0kNuhAL1tw9z1w4856CciLJA@mail.gmail.com>
Date: Sat, 4 Jul 2015 17:08:29 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Lyra2 on GPU

2015-07-04 11:54 GMT+02:00 Solar Designer <solar@...nwall.com>:
> On Sat, Jul 04, 2015 at 02:04:26AM +0200, Agnieszka Bielec wrote:
>> I received results:
>>
>> [a@...er run]$ ./john --test --format=lyra2-opencl --dev=5
>> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
>> development use only)]... Device 5: GeForce GTX TITAN
>> Local worksize (LWS) 64, global worksize (GWS) 2048
>> DONE
>> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2
>> Raw:    6023 c/s real, 5965 c/s virtual
>>
>> [a@...er run]$ ./john --test --format=lyra2-opencl
>> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
>> development use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series]
>> Local worksize (LWS) 64, global worksize (GWS) 2048
>> DONE
>> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2
>> Raw:    7447 c/s real, 51200 c/s virtual
>>
>> before optimizations speed was equal to 1k
>
> Cool.  And these are much better than what you were getting with Lyra2
> authors' CUDA code, right?

yes, but they claimed that theirs implementation isn't optimal

this is the best result I gained

[a@...er run]$ ./john --test --format=lyra2-cuda
Benchmarking: Lyra2-cuda, Lyra2 [Lyra2 CUDA]... DONE
Speed for cost 1 (t) of 8, cost 2 (m) of 8
Raw:    1914 c/s real, 1932 c/s virtual

my first version in opencl had speed more than 1k but I don't remember exactly

>
> Are these higher speeds reproducible on actual cracking runs?  Please test.

what means 'reproducible on actual cracking runs' ?

>
>> my optimizations are based on transfer one table to local memory and
>> copying small portions of global memory into local buffers, I didn't
>> saw any sense i coalescing and I didn't tried it
>
> OK.
>
> Is the "copying small portions of global memory into local buffers" like
> prefetching?  Or are those small portions more frequently accessed than
> the rest?  In other words, why is this optimization effective for Lyra2?

I'm copying data in several separate for loops.only sometimes one
element is accessed two times, mostly it's 1 time, but these portions
are small anyway, 12 ulong's for one random pointer to global memory
(4 is max) so I decided copy even if something is accessed only once,
and I tried to copy bigger portions at once but speed was worse, even
if something is accessed only once it's faster with copying on AMD GPU

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.