john-dev - Re: 64-bit rotate on AMD GCN

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151010053543.GA4887@openwall.com>
Date: Sat, 10 Oct 2015 08:35:43 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Cc: Myrice <qqlddg@...il.com>
Subject: Re: 64-bit rotate on AMD GCN

On Sat, Oct 10, 2015 at 07:52:06AM +0300, Solar Designer wrote:
> #define ror(x, n)       ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))

I've just tried introducing the above revision of ror() into myrice's
xsha512_kernel.cl, which previously used rotate(), and speed went from:

[solar@...er run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test=10 -form=xsha512-opencl -dev=2 -v=4
[...]
Local worksize (LWS) 128, global worksize (GWS) 8388608
DONE
Many salts:     278223K c/s real, 4973M c/s virtual
Only one salt:  56310K c/s real, 72389K c/s virtual

to:

[solar@...er run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test=10 -form=xsha512-opencl -dev=2 -v=4
[...]
Local worksize (LWS) 128, global worksize (GWS) 8388608
DONE
Many salts:     345265K c/s real, 5082M c/s virtual
Only one salt:  58486K c/s real, 72315K c/s virtual

So we should expect 300M+ c/s for raw-sha512 as well (this is also seen
as e.g. "310395000 rounds/s" during auto-tuning for sha512crypt), with
on-GPU mask and hash comparisons when we have those implemented for this
hash type efficiently (I think not yet?)

Similarly to sha512crypt, IL size went way up, and ISA size slightly down:

[solar@...er run]$ ls -l a b
a:
total 840
-rw-------. 1 solar solar   5733 Oct 10 08:13 _temp_0_Tahiti.cl
-rw-------. 1 solar solar   6838 Oct 10 08:13 _temp_0_Tahiti.i
-rw-------. 1 solar solar 163501 Oct 10 08:13 _temp_0_Tahiti.il
-rw-------. 1 solar solar   3445 Oct 10 08:13 _temp_0_Tahiti_kernel_cmp.il
-rw-------. 1 solar solar   5443 Oct 10 08:13 _temp_0_Tahiti_kernel_cmp.isa
-rw-------. 1 solar solar 159665 Oct 10 08:13 _temp_0_Tahiti_kernel_xsha512.il
-rw-------. 1 solar solar 506243 Oct 10 08:13 _temp_0_Tahiti_kernel_xsha512.isa

b:
total 984
-rw-------. 1 solar solar   6038 Oct 10 08:17 _temp_0_Tahiti.cl
-rw-------. 1 solar solar  11310 Oct 10 08:17 _temp_0_Tahiti.i
-rw-------. 1 solar solar 261999 Oct 10 08:17 _temp_0_Tahiti.il
-rw-------. 1 solar solar   3445 Oct 10 08:17 _temp_0_Tahiti_kernel_cmp.il
-rw-------. 1 solar solar   5443 Oct 10 08:17 _temp_0_Tahiti_kernel_cmp.isa
-rw-------. 1 solar solar 258163 Oct 10 08:17 _temp_0_Tahiti_kernel_xsha512.il
-rw-------. 1 solar solar 450253 Oct 10 08:17 _temp_0_Tahiti_kernel_xsha512.isa

[solar@...er run]$ fgrep codeLenInByte [ab]/*.isa
a/_temp_0_Tahiti_kernel_cmp.isa:codeLenInByte        = 172 bytes;
a/_temp_0_Tahiti_kernel_xsha512.isa:codeLenInByte        = 30432 bytes;
b/_temp_0_Tahiti_kernel_cmp.isa:codeLenInByte        = 172 bytes;
b/_temp_0_Tahiti_kernel_xsha512.isa:codeLenInByte        = 30140 bytes;
[solar@...er run]$ fgrep NumVgpr [ab]/*.isa
a/_temp_0_Tahiti_kernel_cmp.isa:NumVgprs             = 3;
a/_temp_0_Tahiti_kernel_xsha512.isa:NumVgprs             = 98;
b/_temp_0_Tahiti_kernel_cmp.isa:NumVgprs             = 3;
b/_temp_0_Tahiti_kernel_xsha512.isa:NumVgprs             = 106;

On a related note, sha512crypt-opencl is now almost same speed as
sha256crypt-opencl on GCN, meaning that there must be lots of room for
improvement in the latter.  sha512crypt-opencl:

gws:    262144       62827   314135000 rounds/s    4.172s per crypt_all()+
Local worksize (LWS) 256, global worksize (GWS) 262144
DONE
Speed for cost 1 (iteration count) of 5000
Raw:    55072 c/s real, 2383K c/s virtual

sha256crypt-opencl:

gws:   1048576       64687   323435000 rounds/s   16.209s per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1048576
DONE
Speed for cost 1 (iteration count) of 5000
Raw:    68089 c/s real, 5518K c/s virtual

Curiously, there's little speed difference between these two seen on
auto-tuning (e.g. 62827 vs. 64687 on final and best lines here), but
more of a difference on final benchmark results.  Also, optimal GWS of
1048576 is very high for a slow hash.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.