Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 4 Jul 2015 17:08:29 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Lyra2 on GPU

2015-07-04 11:54 GMT+02:00 Solar Designer <solar@...nwall.com>:
> On Sat, Jul 04, 2015 at 02:04:26AM +0200, Agnieszka Bielec wrote:
>> I received results:
>>
>> [a@...er run]$ ./john --test --format=lyra2-opencl --dev=5
>> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
>> development use only)]... Device 5: GeForce GTX TITAN
>> Local worksize (LWS) 64, global worksize (GWS) 2048
>> DONE
>> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2
>> Raw:    6023 c/s real, 5965 c/s virtual
>>
>> [a@...er run]$ ./john --test --format=lyra2-opencl
>> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
>> development use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series]
>> Local worksize (LWS) 64, global worksize (GWS) 2048
>> DONE
>> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2
>> Raw:    7447 c/s real, 51200 c/s virtual
>>
>> before optimizations speed was equal to 1k
>
> Cool.  And these are much better than what you were getting with Lyra2
> authors' CUDA code, right?

yes, but they claimed that theirs implementation isn't optimal

this is the best result I gained

[a@...er run]$ ./john --test --format=lyra2-cuda
Benchmarking: Lyra2-cuda, Lyra2 [Lyra2 CUDA]... DONE
Speed for cost 1 (t) of 8, cost 2 (m) of 8
Raw:    1914 c/s real, 1932 c/s virtual

my first version in opencl had speed more than 1k but I don't remember exactly

>
> Are these higher speeds reproducible on actual cracking runs?  Please test.

what means 'reproducible on actual cracking runs' ?

>
>> my optimizations are based on transfer one table to local memory and
>> copying small portions of global memory into local buffers, I didn't
>> saw any sense i coalescing and I didn't tried it
>
> OK.
>
> Is the "copying small portions of global memory into local buffers" like
> prefetching?  Or are those small portions more frequently accessed than
> the rest?  In other words, why is this optimization effective for Lyra2?

I'm copying data in several separate for loops.only sometimes one
element is accessed two times, mostly it's 1 time, but these portions
are small anyway, 12 ulong's for one random pointer to global memory
(4 is max) so I decided copy even if something is accessed only once,
and I tried to copy bigger portions at once but speed was worse, even
if something is accessed only once it's faster with copying on AMD GPU

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ