john-dev - 64-bit rotate on AMD GCN

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151010045206.GA4549@openwall.com>
Date: Sat, 10 Oct 2015 07:52:06 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: 64-bit rotate on AMD GCN

Claudio, magnum, Agnieszka -

I've just tried out Claudio's latest sha512crypt-opencl commits - good
improvement on GCN, thanks!  Most of the time, I got these -test speeds
on super's -dev=2 (Tahiti 1050 MHz, Catalyst 15.7):

Speed for cost 1 (iteration count) of 5000
Raw:    39125 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    38325 c/s real, 728177 c/s virtual

On some rare occasions, I got:

Speed for cost 1 (iteration count) of 5000
Raw:    55538 c/s real, 1456K c/s virtual

even though the speeds reported by -v=4 during auto-tuning never went
above 41k.  It's weird.

Then I noticed that opencl_sha512.h uses:

#define ror(x, n)       ((x >> n) | (x << (64UL-n)))

In the generated ISA code, there are no v_alignbit_b32 instructions.
So I tried to use rotate():

#define ror(x, n)       rotate(x, 64UL-n)

and speeds went up (except for the very rare 55k seen above), e.g. on
four consecutive tests:

Speed for cost 1 (iteration count) of 5000
Raw:    44734 c/s real, 1310K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    48188 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    43690 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    44887 c/s real, 2383K c/s virtual

The ISA changed, but still did not use v_alignbit_b32 instructions.
Notably, while all IL sizes and the ISA sizes for kernel_crypt and
kernel_final remained unchanged (although the instructions and VGPR
usage changed), the ISA size for kernel_prepare reduced:

-codeLenInByte        = 163080 bytes;
+codeLenInByte        = 162504 bytes;

(It's huge either way, though.)

Next I tried to use amd_bitalign() explicitly, initially for a half of
the rotates:

#define ror(x, n)       ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : rotate((x), 64UL - (n)))

IL sizes went way up, ISA sizes significantly down (for all 3 kernels),
and speed went up:

Speed for cost 1 (iteration count) of 5000
Raw:    60124 c/s real, 2621K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    54841 c/s real, 2621K c/s virtual

Of course, there were now v_alignbit_b32 instructions in the generated
ISA code.  I ran only these two benchmarks with this code revision
before proceeding further, to:

#define ror(x, n)       ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))

IL went further up, ISA further down, speeds went in unclear direction:

Speed for cost 1 (iteration count) of 5000
Raw:    54386 c/s real, 2912K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    54841 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    54161 c/s real, 819200 c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw:    53718 c/s real, 728177 c/s virtual

even though the speeds reported during auto-tuning went up (now above
62k is reported during auto-tuning, only to be followed with the 54k'ish
speeds on the final benchmark).  For example:

[solar@...er run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test -form=sha512crypt-opencl -dev=2 -v=4
Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]...
Device 2: Tahiti [AMD Radeon HD 7900 Series]
Calculating best GWS for LWS=64; max. 150ms single kernel invocation.
gws:      2048        5158    25790000 rounds/s 397.002ms per crypt_all()!
gws:      4096       14338    71690000 rounds/s 285.657ms per crypt_all()!
gws:      8192       31089   155445000 rounds/s 263.499ms per crypt_all()!
gws:     16384       59464   297320000 rounds/s 275.524ms per crypt_all()+
gws:     32768       59717   298585000 rounds/s 548.714ms per crypt_all()
gws:     65536       61746   308730000 rounds/s    1.061s per crypt_all()+
gws:    131072       62269   311345000 rounds/s    2.104s per crypt_all()
Calculating best LWS for GWS=65536
Testing LWS=64 GWS=65536 ... 43.049ms+
Testing LWS=128 GWS=65536 ... 45.408ms
Testing LWS=192 GWS=65472 ... 61.590ms
Testing LWS=256 GWS=65536 ... 42.687ms+
Calculating best GWS for LWS=256; max. 300ms single kernel invocation.
gws:      8192       31530   157650000 rounds/s 259.809ms per crypt_all()!
gws:     16384       58888   294440000 rounds/s 278.222ms per crypt_all()+
gws:     32768       60613   303065000 rounds/s 540.605ms per crypt_all()+
gws:     65536       61189   305945000 rounds/s    1.071s per crypt_all()
gws:    131072       61748   308740000 rounds/s    2.122s per crypt_all()+
gws:    262144       63347   316735000 rounds/s    4.138s per crypt_all()+
Local worksize (LWS) 256, global worksize (GWS) 262144
DONE
Speed for cost 1 (iteration count) of 5000
Raw:    55072 c/s real, 2621K c/s virtual

Oh, 63k during the auto-tuning even.  Was that extrapolated from fewer
than 5000 iterations maybe?  Is the instruction cache hit rate worse for
5000 iterations maybe?  We could want to play with how we're splitting
the 5000 iterations across kernel invocations to hopefully regain this
speed for actual runs.  ... or maybe we have it for actual runs already:

[solar@...er run]$ ./john -form=sha512crypt-opencl -dev=2 -v=5 -inc=alpha -min-len=8 -max-len=8 pw
[...]
Local worksize (LWS) 64, global worksize (GWS) 262144
[...]
0g 0:00:00:00  0g/s 0p/s 0c/s 0C/s
0g 0:00:00:16  0g/s 48907p/s 48907c/s 48907C/s bigetort..soyfryap
0g 0:00:00:17  0g/s 46044p/s 46044c/s 46044C/s bigetort..soyfryap
0g 0:00:00:18  0g/s 58060p/s 58060c/s 58060C/s soyfryam..calefryn
0g 0:00:00:19  0g/s 54985p/s 54985c/s 54985C/s soyfryam..calefryn
0g 0:00:00:20  0g/s 52245p/s 52245c/s 52245C/s soyfryam..calefryn
0g 0:00:00:21  0g/s 49742p/s 49742c/s 49742C/s soyfryam..calefryn
0g 0:00:00:23  0g/s 56839p/s 56839c/s 56839C/s calefrya..astefeto
0g 0:00:00:33  0g/s 55505p/s 55505c/s 55505C/s chumaist..metalito
0g 0:00:00:34  0g/s 53859p/s 53859c/s 53859C/s chumaist..metalito
0g 0:00:00:38  0g/s 55101p/s 55101c/s 55101C/s metality..singuapp
0g 0:00:00:39  0g/s 60386p/s 60386c/s 60386C/s singuazy..abbortom
0g 0:00:00:40  0g/s 58923p/s 58923c/s 58923C/s singuazy..abbortom
0g 0:00:00:41  0g/s 57473p/s 57473c/s 57473C/s singuazy..abbortom
0g 0:00:00:42  0g/s 56093p/s 56093c/s 56093C/s singuazy..abbortom
0g 0:00:00:43  0g/s 60864p/s 60864c/s 60864C/s abbortot..mcmyleow
abcdefgh         (?)
1g 0:00:00:47 DONE (2015-10-10 07:24) 0.02114g/s 60976p/s 60976c/s 60976C/s abbortot..mcmyleow

61k average for a 47 seconds run, not bad.  Maybe this includes a not
fully processed last buffer (4 or 5 seconds), though?

As a final experiment, I tried omitting the "- 32" from my uses of
amd_bitalign() that had it, because amd_bitalign() is defined (unlike
e.g. bitwise shifts in C) to take its shift count "& 31":

https://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops.txt

Specifically:

#define ror(x, n)       ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)/* - 32*/) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)/* - 32*/) << 32)))

Notice two commented out "- 32"'s.  The resulting ISA size remained the
same, and speeds too, but the changed shift counts (now some above 31)
are actually encoded in the instructions.  It's curious that GCN even
has room in the instructions to encode those redundant shift counts,
instead of applying the "& 31" at compile time.

I think it's better to use the previous version, with the "- 32"'s
intact, as long as the rotate counts are compile-time constants, so this
subtraction is performed at compile time as well.

Further speedup might be possible through switching the SWAP64() macro
from using rotate() to using the approach above.  However, it is
currently used on both vector and scalar arguments, so my trivial
attempt at changing it like that failed.  I guess we'd need to split it
into two macros: one for vectors and one for scalars.  I am leaving this
for Claudio or/and magnum to experiment with.

Agnieszka - the 64-bit rotate optimizations discussed above are probably
also applicable to Argon2 and Lyra2.

Claudio - some additional info on the experiments above, in the same
order (first is your code, then my revisions as described above):

[solar@...er run]$ fgrep codeLenInByte ?/*.isa
1/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte        = 59044 bytes;
1/_temp_0_Tahiti_kernel_final.isa:codeLenInByte        = 60240 bytes;
1/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte        = 163080 bytes;
2/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte        = 59044 bytes;
2/_temp_0_Tahiti_kernel_final.isa:codeLenInByte        = 60240 bytes;
2/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte        = 162504 bytes;
3/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte        = 59164 bytes;
3/_temp_0_Tahiti_kernel_final.isa:codeLenInByte        = 60328 bytes;
3/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte        = 158704 bytes;
4/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte        = 59148 bytes;
4/_temp_0_Tahiti_kernel_final.isa:codeLenInByte        = 60316 bytes;
4/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte        = 158616 bytes;
5/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte        = 59148 bytes;
5/_temp_0_Tahiti_kernel_final.isa:codeLenInByte        = 60316 bytes;
5/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte        = 158616 bytes;
[solar@...er run]$ fgrep NumVgpr ?/*.isa
1/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs             = 116;
1/_temp_0_Tahiti_kernel_final.isa:NumVgprs             = 116;
1/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs             = 90;
2/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs             = 115;
2/_temp_0_Tahiti_kernel_final.isa:NumVgprs             = 115;
2/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs             = 91;
3/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs             = 124;
3/_temp_0_Tahiti_kernel_final.isa:NumVgprs             = 119;
3/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs             = 88;
4/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs             = 119;
4/_temp_0_Tahiti_kernel_final.isa:NumVgprs             = 118;
4/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs             = 87;
5/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs             = 119;
5/_temp_0_Tahiti_kernel_final.isa:NumVgprs             = 118;
5/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs             = 87;

There was a spike in NumVgprs for the mixed version (with bitalign only
used for half of the rotates), yet the speed was good.

ScratchSize remained the same as yours in all of these revisions.

1/_temp_0_Tahiti_kernel_crypt.isa:ScratchSize          = 32 dwords/thread;
1/_temp_0_Tahiti_kernel_final.isa:ScratchSize          = 32 dwords/thread;
1/_temp_0_Tahiti_kernel_prepare.isa:ScratchSize          = 108 dwords/thread;

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.