john-dev - Re: Idea to increase plaintext length for GPU based hashes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40d54e632e3993d12f38e624336e489c@smtp.hushmail.com>
Date: Fri, 22 Mar 2013 23:36:40 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Idea to increase plaintext length for GPU based hashes

On 19 Mar, 2013, at 2:52 , Brian Wallace <bwall@...nbwall.com> wrote:
> On 03/18/2013 09:21 PM, magnum wrote:
>> I've had similar thoughts. For some reason I rejected the idea so far because I did not think it will fly IRL. I'm not sure how I came to that conclusion, we should definitely try it. The greatest obstacle might be set_key() performance - this function is a bottle-neck already. Maybe it won't get much slower though?
> 
> I'm not sure if it would be slower.  If a pointer is maintained to the position of the last character added to a (max keys * max size) buffer, as well an integer defining the offset, we can do the writes with little computation.  Since the size of the PointerObject would be static, a single buffer could be used, where the data is put after where the last PointerObject would fill.

I tried this for raw-md5-opencl in bleeding. It was easy and the results were so good I have committed it, although again I had hoped for a little more.

I use a single int array for the index, with the low 6 bits representing length (we are limited to 55 anyway for one round of MD5) and the high 26 bits as index. I align to int and the index is for an int pointer. This works out to a maximum allowed work size of 4793490 which is more than the Tahiti wants anyway.

Before this, with max length 32:

Device 0: GeForce GTX 570
Local worksize (LWS) 128, global worksize (GWS) 2097152
Benchmarking: Raw MD5 [OpenCL (inefficient, development use only)]... DONE
Raw:    40887K c/s real, 41146K c/s virtual

Device 1: Tahiti (AMD Radeon HD 7900 Series)
Local worksize (LWS) 128, global worksize (GWS) 4194304
Benchmarking: Raw MD5 [OpenCL (inefficient, development use only)]... DONE
Raw:    29399K c/s real, 48770K c/s virtual

(I have a faint memory of Solar mentioning the Tahiti gets fewer PCI lanes than the nvidia in this setup)

New figures, with max length 55:

Device 0: GeForce GTX 570 
Local worksize (LWS) 128, global worksize (GWS) 4194304
Benchmarking: Raw MD5 [OpenCL (inefficient, development use only)]... DONE
Raw:    51569K c/s real, 51569K c/s virtual

Device 1: Tahiti (AMD Radeon HD 7900 Series)
Local worksize (LWS) 128, global worksize (GWS) 4194304
Benchmarking: Raw MD5 [OpenCL (inefficient, development use only)]... DONE
Raw:    48395K c/s real, 63550K c/s virtual

...this despite I re-added some test vectors so the latter are using increased (and diverging) lengths in the benchmark.


BTW, here's my laptop figures, it really shows how bad the above still is:

Device 1: GeForce GT 650M 
Local worksize (LWS) 128, global worksize (GWS) 4793472
Benchmarking: Raw MD5 [OpenCL (inefficient, development use only)]... DONE
Raw:	48822K c/s real, 75325K c/s virtual


If nothing else, it's a very good thing we support this large lengths with no downside. I tried to profile it with Valgrind and it looks like much of the CPU time is spent in strlen(). So I tried another version of set_key() that copies a byte at a time (instead of an int at a time) but don't need strlen(), but it ended up slower. I still think we can cram some more out of this, on host side.

Sure, the "real fix" is key generation or masks (or better, rules!) on GPU. But that does not apply to all use cases. We should definitely do this to more formats. NTLMv2 is my next one.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.