john-dev - Re: Result of hard core password generation on 7970

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120724183346.GB30550@openwall.com>
Date: Tue, 24 Jul 2012 22:33:46 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Result of hard core password generation on 7970

myrice -

By "hard core", I guess you mean "hard-coded"? ;-)

On Tue, Jul 24, 2012 at 08:52:36PM +0800, myrice wrote:
> raw-md5-opencl with PG
> 1
> guesses: 0  time: 0:00:04:52 0.00%  c/s: 196758K

This is about 20 times slower than the target speed, but it is several
times faster than speed on a single CPU core.  So you should proceed to
implement/tune this further.

> 1000
> guesses: 2  time: 0:00:01:44 0.00%  c/s: 30660M

Roughly same speed as we had before, and as we get on one CPU core.
Not good at all.

> 1000000
> Unable to get, will reduce the max_keys_per_crypt number

Yes, as discussed on IRC your max_keys_per_crypt was unreasonably high
(left over from the previous implementation, which did not generate
additional candidate passwords on its own).

> or divided it
> to 1000s kernel's run(every run with fixed 1000 loaded hashes), so
> loaded hashes could take advantages of constant memory.

That's a bad idea.  With saltless hashes (and with many hashes per
salt), it is crucial to only hash each candidate password once, then
compare against all loaded hashes (indirectly if their number is
large).  By processing the loaded hashes in smaller chunks, you give up
the advantage from their saltlessness (or same-saltness).

> raw-md5-opencl origin
> 1
> guesses: 0  time: 0:00:06:42 0.00%  c/s: 27852K
> 
> 1000
> guesses: 0  time: 0:00:06:35 0.00%  c/s: 23965M
> 
> 1M
> guesses: 0  time: 0:00:09:14 0.00%  c/s: 17087G
> 
> 
> raw-md5
> 1
> guesses: 0  time: 0:00:05:01 0.00%  c/s: 25639K
> 
> 1000
> guesses: 0  time: 0:00:05:13 0.00%  c/s: 23262M
> 
> 1M
> guesses: 0  time: 0:00:02:46 0.00%  c/s: 16777G

Yes, we had roughly the same speed on one CPU core vs. GPU for raw-md5
before your changes.  CPU + RAM was the bottleneck in either case.

Also, as you can see the bitmap worked quite well on CPU, where the
number of hash computations per second reduced from 25.6M to 23.3M and
then to 16.8M - similar numbers, even though the number of loaded hashes
changed by a factor of 1 million.
 
> The struct of kernel is as follows
> for i in [a-zA-Z]
>   for j in [a-zA-Z]
>      add [a-zA-Z][a-zA-Z] to key
>      MD5_Hash
>      for k in loaded_hashes
>          compare computed hash with loaded_hashes[k]
>      endfor k
>   endfor j
> endfor i

Yes, but the hash comparisons must be indirect when the number of loaded
hashes is non-trivial.

> So I guess the compared work takes so much time. I just changed
> __global loaded_hashes to __constant loaded_hashes, it runs correct on
> GTX570 and CPU, but on 7970, it causes segmentation fault.

I think it may be fine to keep the loaded hashes themselves in global
memory (in fact, you have to when there's a lot of them), but you may
try to fit the bitmap in local memory when you reasonably can.  Only
upgrade it to a larger size (not fitting in local memory) when that is
beneficial overall (you'll have to tune this threshold).  You'll need to
include code for at least two bitmap sizes (although on CPU we currently
support 7 of them) and pick one that is expected to work faster given
the actual number of loaded hashes (make this decision at runtime).

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.