john-dev - bcrypt-opencl local vs. private memory

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150507195226.GA15044@openwall.com>
Date: Thu, 7 May 2015 22:52:26 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: bcrypt-opencl local vs. private memory

Sayantan, magnum -

I just realized that our bcrypt-opencl's bf_kernel.cl is using local
rather than private memory on all GPUs.  While this is right for AMD, it
might not be right for NVIDIA.  Here's what I am getting with unchanged
bf_kernel.cl on super's GTX TITAN:

$ ./john -te -form=bcrypt-opencl -dev=5
Device 5: GeForce GTX TITAN
Local worksize (LWS) 8, Global worksize (GWS) 128
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    487 c/s real, 487 c/s virtual

$ GWS=1024 ./john -te -form=bcrypt-opencl -dev=5
Device 5: GeForce GTX TITAN
Local worksize (LWS) 8, Global worksize (GWS) 1024
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    781 c/s real, 787 c/s virtual

BTW, it's unclear why the auto-tuning didn't go higher than GWS=128.
Here's the best speed we got for it before (IIRC, also with manual GWS):

http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-45.html

This says 813 c/s.

Changing "#define MAYBE_LOCAL" in bf_kernel.cl from __local to
__private, I got:

$ ./john -te -form=bcrypt-opencl -dev=5
Device 5: GeForce GTX TITAN
Local worksize (LWS) 8, Global worksize (GWS) 1024
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    860 c/s real, 853 c/s virtual

GWS=1024 right away, and the speed is slightly better.  BTW, 2048 fails:

$ GWS=2048 ./john -te -form=bcrypt-opencl -dev=5
Device 5: GeForce GTX TITAN
Local worksize (LWS) 8, Global worksize (GWS) 2048
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Segmentation fault

I think we should test on other NVIDIA cards (such as on GTX 570 in
bull) and maybe make this the default for NVIDIA.  It may also make
sense to place some of the S-boxes in local and some in private.  Maybe
this will result in a higher optimal GWS.

Will you take this task from here, please?

BTW, on AMD this results in huge slowdown.  local:

$ ./john -te -form=bcrypt-opencl -dev=1
Device 1: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 4, Global worksize (GWS) 1024
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    4231 c/s real, 512000 c/s virtual

private:

$ ./john -te -form=bcrypt-opencl -dev=1
Device 1: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 4, Global worksize (GWS) 512
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    775 c/s real, 102400 c/s virtual

$ GWS=1024 ./john -te -form=bcrypt-opencl -dev=1
Device 1: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 4, Global worksize (GWS) 1024
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    775 c/s real, 102400 c/s virtual

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.