Date: Tue, 24 Jul 2012 22:33:46 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Result of hard core password generation on 7970 myrice - By "hard core", I guess you mean "hard-coded"? ;-) On Tue, Jul 24, 2012 at 08:52:36PM +0800, myrice wrote: > raw-md5-opencl with PG > 1 > guesses: 0 time: 0:00:04:52 0.00% c/s: 196758K This is about 20 times slower than the target speed, but it is several times faster than speed on a single CPU core. So you should proceed to implement/tune this further. > 1000 > guesses: 2 time: 0:00:01:44 0.00% c/s: 30660M Roughly same speed as we had before, and as we get on one CPU core. Not good at all. > 1000000 > Unable to get, will reduce the max_keys_per_crypt number Yes, as discussed on IRC your max_keys_per_crypt was unreasonably high (left over from the previous implementation, which did not generate additional candidate passwords on its own). > or divided it > to 1000s kernel's run(every run with fixed 1000 loaded hashes), so > loaded hashes could take advantages of constant memory. That's a bad idea. With saltless hashes (and with many hashes per salt), it is crucial to only hash each candidate password once, then compare against all loaded hashes (indirectly if their number is large). By processing the loaded hashes in smaller chunks, you give up the advantage from their saltlessness (or same-saltness). > raw-md5-opencl origin > 1 > guesses: 0 time: 0:00:06:42 0.00% c/s: 27852K > > 1000 > guesses: 0 time: 0:00:06:35 0.00% c/s: 23965M > > 1M > guesses: 0 time: 0:00:09:14 0.00% c/s: 17087G > > > raw-md5 > 1 > guesses: 0 time: 0:00:05:01 0.00% c/s: 25639K > > 1000 > guesses: 0 time: 0:00:05:13 0.00% c/s: 23262M > > 1M > guesses: 0 time: 0:00:02:46 0.00% c/s: 16777G Yes, we had roughly the same speed on one CPU core vs. GPU for raw-md5 before your changes. CPU + RAM was the bottleneck in either case. Also, as you can see the bitmap worked quite well on CPU, where the number of hash computations per second reduced from 25.6M to 23.3M and then to 16.8M - similar numbers, even though the number of loaded hashes changed by a factor of 1 million. > The struct of kernel is as follows > for i in [a-zA-Z] > for j in [a-zA-Z] > add [a-zA-Z][a-zA-Z] to key > MD5_Hash > for k in loaded_hashes > compare computed hash with loaded_hashes[k] > endfor k > endfor j > endfor i Yes, but the hash comparisons must be indirect when the number of loaded hashes is non-trivial. > So I guess the compared work takes so much time. I just changed > __global loaded_hashes to __constant loaded_hashes, it runs correct on > GTX570 and CPU, but on 7970, it causes segmentation fault. I think it may be fine to keep the loaded hashes themselves in global memory (in fact, you have to when there's a lot of them), but you may try to fit the bitmap in local memory when you reasonably can. Only upgrade it to a larger size (not fitting in local memory) when that is beneficial overall (you'll have to tune this threshold). You'll need to include code for at least two bitmap sizes (although on CPU we currently support 7 of them) and pick one that is expected to work faster given the actual number of loaded hashes (make this decision at runtime). Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.