Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 9 Jun 2013 00:07:04 +0200
From: Lukas Odzioba <lukas.odzioba@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: sha3-opencl

2013/6/6 Dániel Bali <balijanosdaniel@...il.com>:
> daniel@...l:~/bleeding-jumbo/JohnTheRipper/src$ ../run/john -test
> --format=raw-keccak256-opencl
> Device 0: GeForce GTX 570
> Local worksize (LWS) 128, global worksize (GWS) 524288
> Benchmarking: raw-keccak256-opencl, Raw Keccak256 [OpenCL (inefficient,
> development use only)]... DONE
> Raw:    27525K c/s real, 27525K c/s virtual

Daniel please share with us current (we added longer test vectors)
speeds on 7970, 570 and bulls cpu (using opencl implementation).

For comparision this is the result of AMD FX(tm)-8120 using AVX
Benchmarking: raw-keccak-256, Keccak 256 [AVX]... DONE
Raw:    2056K c/s real, 2076K c/s virtual

Because this code was not so easy to move to gpu in limited time we
decided to change basic implementation to Matt Mahoney's code
available here:
http://encode.ru/threads/1613-SHA3-winner-announced/page2

As far as I remember on bull's cpu we were getting ~1200K c/s using his code.

Some more comments to code:
1) This loop is generating duplicates:
        // (Hack) clear output buffers first
      for (i = 0; i < 32; ++i) {
              hashes[(i/4) * num_keys + gid] = 0;
      }
Lets assume:
    int num_keys=1024;
    int gid=5;
    for (i = 0; i < 32; ++i) {
            printf("%d ",(i/4) * num_keys + gid);
      }
And we are getting: 5 5 5 5 1029 1029 1029 1029 2053 2053 2053 2053
3077 3077 3077 3077 4101 4101 4101 4101 5125 5125 5125 5125 6149 6149
6149 6149 7173 7173 7173 7173

2) We can add #pragma unroll N where N is constant, for example:
// keccak::get()
#pragma unroll 32
for (i = 0; i < 32; ++i) {
We're not getting any c/s improvement by doing this (now), but for
purity it is good to do that and do not care later.

3) I am curious what ISA code is generated by those macros:

#define GETCHAR(buf, index) ((uchar)(buf >> index * 8) & 0xff)
#define PUTCHAR(buf, index, val) (buf |= (val << index * 8))

Usually it is better to make reads/writes to global memory 32bit wide.
If you have some free time tomorrow you can check this out.
Next week we will move to more serious tasks.

4) You can try to run opencl profiler and share here some of the
results of this analysis. I have some intuition but it is an exercise
for you.

How did you liked this task?
Lukas

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.