|
Date: Sat, 10 Oct 2015 07:52:06 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: 64-bit rotate on AMD GCN Claudio, magnum, Agnieszka - I've just tried out Claudio's latest sha512crypt-opencl commits - good improvement on GCN, thanks! Most of the time, I got these -test speeds on super's -dev=2 (Tahiti 1050 MHz, Catalyst 15.7): Speed for cost 1 (iteration count) of 5000 Raw: 39125 c/s real, 1456K c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 38325 c/s real, 728177 c/s virtual On some rare occasions, I got: Speed for cost 1 (iteration count) of 5000 Raw: 55538 c/s real, 1456K c/s virtual even though the speeds reported by -v=4 during auto-tuning never went above 41k. It's weird. Then I noticed that opencl_sha512.h uses: #define ror(x, n) ((x >> n) | (x << (64UL-n))) In the generated ISA code, there are no v_alignbit_b32 instructions. So I tried to use rotate(): #define ror(x, n) rotate(x, 64UL-n) and speeds went up (except for the very rare 55k seen above), e.g. on four consecutive tests: Speed for cost 1 (iteration count) of 5000 Raw: 44734 c/s real, 1310K c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 48188 c/s real, 728177 c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 43690 c/s real, 728177 c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 44887 c/s real, 2383K c/s virtual The ISA changed, but still did not use v_alignbit_b32 instructions. Notably, while all IL sizes and the ISA sizes for kernel_crypt and kernel_final remained unchanged (although the instructions and VGPR usage changed), the ISA size for kernel_prepare reduced: -codeLenInByte = 163080 bytes; +codeLenInByte = 162504 bytes; (It's huge either way, though.) Next I tried to use amd_bitalign() explicitly, initially for a half of the rotates: #define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : rotate((x), 64UL - (n))) IL sizes went way up, ISA sizes significantly down (for all 3 kernels), and speed went up: Speed for cost 1 (iteration count) of 5000 Raw: 60124 c/s real, 2621K c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 54841 c/s real, 2621K c/s virtual Of course, there were now v_alignbit_b32 instructions in the generated ISA code. I ran only these two benchmarks with this code revision before proceeding further, to: #define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32))) IL went further up, ISA further down, speeds went in unclear direction: Speed for cost 1 (iteration count) of 5000 Raw: 54386 c/s real, 2912K c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 54841 c/s real, 1456K c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 54161 c/s real, 819200 c/s virtual Speed for cost 1 (iteration count) of 5000 Raw: 53718 c/s real, 728177 c/s virtual even though the speeds reported during auto-tuning went up (now above 62k is reported during auto-tuning, only to be followed with the 54k'ish speeds on the final benchmark). For example: [solar@...er run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test -form=sha512crypt-opencl -dev=2 -v=4 Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... Device 2: Tahiti [AMD Radeon HD 7900 Series] Calculating best GWS for LWS=64; max. 150ms single kernel invocation. gws: 2048 5158 25790000 rounds/s 397.002ms per crypt_all()! gws: 4096 14338 71690000 rounds/s 285.657ms per crypt_all()! gws: 8192 31089 155445000 rounds/s 263.499ms per crypt_all()! gws: 16384 59464 297320000 rounds/s 275.524ms per crypt_all()+ gws: 32768 59717 298585000 rounds/s 548.714ms per crypt_all() gws: 65536 61746 308730000 rounds/s 1.061s per crypt_all()+ gws: 131072 62269 311345000 rounds/s 2.104s per crypt_all() Calculating best LWS for GWS=65536 Testing LWS=64 GWS=65536 ... 43.049ms+ Testing LWS=128 GWS=65536 ... 45.408ms Testing LWS=192 GWS=65472 ... 61.590ms Testing LWS=256 GWS=65536 ... 42.687ms+ Calculating best GWS for LWS=256; max. 300ms single kernel invocation. gws: 8192 31530 157650000 rounds/s 259.809ms per crypt_all()! gws: 16384 58888 294440000 rounds/s 278.222ms per crypt_all()+ gws: 32768 60613 303065000 rounds/s 540.605ms per crypt_all()+ gws: 65536 61189 305945000 rounds/s 1.071s per crypt_all() gws: 131072 61748 308740000 rounds/s 2.122s per crypt_all()+ gws: 262144 63347 316735000 rounds/s 4.138s per crypt_all()+ Local worksize (LWS) 256, global worksize (GWS) 262144 DONE Speed for cost 1 (iteration count) of 5000 Raw: 55072 c/s real, 2621K c/s virtual Oh, 63k during the auto-tuning even. Was that extrapolated from fewer than 5000 iterations maybe? Is the instruction cache hit rate worse for 5000 iterations maybe? We could want to play with how we're splitting the 5000 iterations across kernel invocations to hopefully regain this speed for actual runs. ... or maybe we have it for actual runs already: [solar@...er run]$ ./john -form=sha512crypt-opencl -dev=2 -v=5 -inc=alpha -min-len=8 -max-len=8 pw [...] Local worksize (LWS) 64, global worksize (GWS) 262144 [...] 0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s 0g 0:00:00:16 0g/s 48907p/s 48907c/s 48907C/s bigetort..soyfryap 0g 0:00:00:17 0g/s 46044p/s 46044c/s 46044C/s bigetort..soyfryap 0g 0:00:00:18 0g/s 58060p/s 58060c/s 58060C/s soyfryam..calefryn 0g 0:00:00:19 0g/s 54985p/s 54985c/s 54985C/s soyfryam..calefryn 0g 0:00:00:20 0g/s 52245p/s 52245c/s 52245C/s soyfryam..calefryn 0g 0:00:00:21 0g/s 49742p/s 49742c/s 49742C/s soyfryam..calefryn 0g 0:00:00:23 0g/s 56839p/s 56839c/s 56839C/s calefrya..astefeto 0g 0:00:00:33 0g/s 55505p/s 55505c/s 55505C/s chumaist..metalito 0g 0:00:00:34 0g/s 53859p/s 53859c/s 53859C/s chumaist..metalito 0g 0:00:00:38 0g/s 55101p/s 55101c/s 55101C/s metality..singuapp 0g 0:00:00:39 0g/s 60386p/s 60386c/s 60386C/s singuazy..abbortom 0g 0:00:00:40 0g/s 58923p/s 58923c/s 58923C/s singuazy..abbortom 0g 0:00:00:41 0g/s 57473p/s 57473c/s 57473C/s singuazy..abbortom 0g 0:00:00:42 0g/s 56093p/s 56093c/s 56093C/s singuazy..abbortom 0g 0:00:00:43 0g/s 60864p/s 60864c/s 60864C/s abbortot..mcmyleow abcdefgh (?) 1g 0:00:00:47 DONE (2015-10-10 07:24) 0.02114g/s 60976p/s 60976c/s 60976C/s abbortot..mcmyleow 61k average for a 47 seconds run, not bad. Maybe this includes a not fully processed last buffer (4 or 5 seconds), though? As a final experiment, I tried omitting the "- 32" from my uses of amd_bitalign() that had it, because amd_bitalign() is defined (unlike e.g. bitwise shifts in C) to take its shift count "& 31": https://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops.txt Specifically: #define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)/* - 32*/) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)/* - 32*/) << 32))) Notice two commented out "- 32"'s. The resulting ISA size remained the same, and speeds too, but the changed shift counts (now some above 31) are actually encoded in the instructions. It's curious that GCN even has room in the instructions to encode those redundant shift counts, instead of applying the "& 31" at compile time. I think it's better to use the previous version, with the "- 32"'s intact, as long as the rotate counts are compile-time constants, so this subtraction is performed at compile time as well. Further speedup might be possible through switching the SWAP64() macro from using rotate() to using the approach above. However, it is currently used on both vector and scalar arguments, so my trivial attempt at changing it like that failed. I guess we'd need to split it into two macros: one for vectors and one for scalars. I am leaving this for Claudio or/and magnum to experiment with. Agnieszka - the 64-bit rotate optimizations discussed above are probably also applicable to Argon2 and Lyra2. Claudio - some additional info on the experiments above, in the same order (first is your code, then my revisions as described above): [solar@...er run]$ fgrep codeLenInByte ?/*.isa 1/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes; 1/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes; 1/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 163080 bytes; 2/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes; 2/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes; 2/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 162504 bytes; 3/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59164 bytes; 3/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60328 bytes; 3/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158704 bytes; 4/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes; 4/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes; 4/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes; 5/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes; 5/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes; 5/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes; [solar@...er run]$ fgrep NumVgpr ?/*.isa 1/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 116; 1/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 116; 1/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 90; 2/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 115; 2/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 115; 2/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 91; 3/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 124; 3/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 119; 3/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 88; 4/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119; 4/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118; 4/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87; 5/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119; 5/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118; 5/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87; There was a spike in NumVgprs for the mixed version (with bitalign only used for half of the rotates), yet the speed was good. ScratchSize remained the same as yours in all of these revisions. 1/_temp_0_Tahiti_kernel_crypt.isa:ScratchSize = 32 dwords/thread; 1/_temp_0_Tahiti_kernel_final.isa:ScratchSize = 32 dwords/thread; 1/_temp_0_Tahiti_kernel_prepare.isa:ScratchSize = 108 dwords/thread; Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.