john-dev - Re: bcrypt-opencl local vs. private memory

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <613f4f8dfad890b4bff5527da0769358@smtp.hushmail.com>
Date: Mon, 22 Jun 2015 21:20:51 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: bcrypt-opencl local vs. private memory

On 2015-06-22 05:49, Solar Designer wrote:
> On Sun, Jun 21, 2015 at 01:30:52AM +0200, magnum wrote:
>> On 2015-06-20 23:04, Solar Designer wrote:
>>> magnum, can we possibly have
>>> this local vs. private bit autodetected along with GWS and LWS?
>>
>> Well the bcrypt format could do so. That would be for Sayantan to
>> implement. However, I just commited a workaround for now, simply using
>> nvidia_sm_5x() instead of gpu_nvidia().
>
> This is based on testing on your Maxwell card?  What speeds are you
> getting for local vs. private memory there?  And what card is that?

I was confused, I had the idea your Titan was somehow sm_5x despite not 
being Maxwell. But more on Maxwell below.

>> BTW for my Kepler GPU, I see no difference between using local or private.
>
> Note that I initially pointed this out for a Kepler - the TITAN that we
> have in super:
>
> http://www.openwall.com/lists/john-dev/2015/05/07/36

It seems I screwed up (again) when checking that. My little toy Kepler 
is indeed faster using private. Unfortunately the nvidia_sm* macros 
don't work on OSX (they depend on proprietary extensions to OpenCL which 
Apple doesn't include even for their nvidia drivers).

> So maybe the check should be:
>
> #if nvidia_sm_3x(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)

Actually only sm_3x. I tested this on a Titan X today and local is much 
better there:

Using private:
Device 0: GeForce GTX TITAN X
Local worksize (LWS) 8, Global worksize (GWS) 2048
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish 
OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    790 c/s real, 787 c/s virtual

Using local:
Device 0: GeForce GTX TITAN X
Local worksize (LWS) 8, Global worksize (GWS) 4096
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish 
OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    5354 c/s real, 5319 c/s virtual

BTW I tested oclHashcat too and it does 11570 c/s, we don't even do half 
of that :-/

Anyway, I have now committed a proper change (sm_3x gets private, all 
others get local). I may try to find a workaround for OSX detection some 
rainy day. For example, if CUDA is enabled we could fall back to CUDA 
queries for that.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.