john-dev - Re: PHC: Parallel in OpenCL

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a9f494abd537e92204ef29f3f2130a4e@smtp.hushmail.com>
Date: Thu, 04 Jun 2015 13:38:36 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Parallel in OpenCL

On 2015-06-04 12:07, magnum wrote:
> On 2015-06-04 00:45, Lukas Odzioba wrote:
>> Agnieszka tried to implement optimization that exploits presence of 0
>> bytes in the sha512 input, which happens in "parallel loop".
>> We can't make such assumptions for all sha512 calls used in function
>> parallel, so implementing slightly different SHA512 with this
>> optimizations (and still we had to have the normal version) increased
>> code size, which what we think reduced performance because code size
>> exceeded L1 code cache on GCN, actual performance after this change
>> dropped from 45k to 28k c/s.
>> She also implemented splitted kernel and it itself also degradated
>> performance (from 28k to 27k c/s).
>
> Isn't the loop kernel using the "zeros" sha function (only)? And the
> other kernels use the full version? Then I can't see how code size would
> be larger for a given kernel.

I had a quick look at the code.

Agnieszka, you again based your code on a fast hash's code. While this 
doesn't hurt performance, it also does no good at all but makes 
everything more complicated and the code harder to read.

In order for the kernel to build on OSX, I had to apply the following 
patch (driver bug workaround, should not affect resulting code):

-       unsigned long state[8] = {
-               0x6a09e667f3bcc908, 0xbb67ae8584caa73b, 
0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1,
-               0x510e527fade682d1, 0x9b05688c2b3e6c1f, 
0x1f83d9abfb41bd6b, 0x5be0cd19137e2179};
+       unsigned long state[8];
         unsigned int left = length;

+       state[0] = 0x6a09e667f3bcc908UL;
+       state[1] = 0xbb67ae8584caa73bUL;
+       state[2] = 0x3c6ef372fe94f82bUL;
+       state[3] = 0xa54ff53a5f1d36f1UL;
+       state[4] = 0x510e527fade682d1UL;
+       state[5] = 0x9b05688c2b3e6c1fUL;
+       state[6] = 0x1f83d9abfb41bd6bUL;
+       state[7] = 0x5be0cd19137e2179UL;
+

The biggest of my concerns is that in the end of the day you only call 
the loop kernel once (for cost 0). Normally you'd call such kernel 10 or 
many more times, with less work being performed per call. This means the 
split only adds overhead and this is why you see a performance 
regression. It looks to me each call is 3*5*128 rounds of SHA512? Maybe 
each loop kernel invocation should just be 128 rounds? And you'd call it 
3*5 times.

Also, your use of the shared auto-tune function is totally busted. You'd 
be much better off not using auto-tune at all than configuring it wrong. 
Attached is a patch that mostly fixes it.

Note these lines (after my patch):

        opencl_init_auto_setup(SEED, 3*5*128*1, split_events,
             warn, 4, self, create_clobj, release_clobj, BINARY_SIZE*3, 0);

        autotune_run(self, 3*5*128*1, 0, 1000);

If you change the loop kernel to do only 128 rounds per call, you should 
change it accordingly for opencl_init_auto_setup() but not for 
autotune_run(). The latter is total, the former is how much you do per 
call. If you change to a test vector with another cost, change the *1 
accordingly for both.

magnum

View attachment "parallel.diff" of type "text/plain" (6205 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.