Date: Wed, 18 Apr 2012 20:38:47 +0800 From: myrice <qqlddg@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Weekly report 1 On Wed, Apr 18, 2012 at 12:34 AM, Solar Designer <solar@...nwall.com> wrote: > > IIRC, what you tried was not supposed to result in any speedup because > your GPU code was invoked and required the data to be already available > right after you started the async copy - so you had it waiting for data > right at that point anyway. > > Lukas' code was different: IIRC, he split the buffered candidate > passwords in three smaller chunks, where two of the three may in fact be > transferred to the GPU asynchronously while the previous chunk is being > processed. You may implement that too, and I suggest that you make the > number of chunks to use configurable and try values larger than 3 (e.g., > 10 might be reasonable - letting you hide the latency for 9 out of 10 > transfers while hopefully not exceeding the size of a CPU data cache). > > I tried after Lukas posted his code. If you remember, I have a ITERATIONS in my code. I split max_keys_per_crypt to ITERATIONS parts. I have not posted my new result. From profiler, the cudamemcpy is still not overlap with computing kernel and there is performance regression. > > 2. Merge cmp_all() with crypt_all() > > For crypt_all(), we just return. In cmp_all(), we invoke GPU and > return > > a value indicate if there is a matched hash. > > This is going to be problematic. It will only work well for the special > case (albeit most common) of exactly one hash per salt. When there are > a few more hashes per salt, cmp_all() is called multiple times, so you > will once again have increased CPU/GPU interaction overhead. When there > are many more hashes per salt, cmp_all() is not called at all, but > instead a get_hash*() function is called. > > Yes, I just noticed this. I took a look at crack.c. In crk_password_loop(), we invoke crypt_all for crypt a bunch of passwords. And next we invoke cmp_all for all hashes with same salt. But I still not sure about how to use get_hash*(). > This is why I suggested caching of loaded hashes in previous calls to > cmp_all(), such that you can move the comparisons into crypt_all() > starting with the second call to that function. Then your GPU code for > crypt_all() will return a flag telling your CPU code for cmp_all() to > just return that fixed value instead of invoking any GPU code. > This just likes caching of salts as you suggested too. I think they belong to same type of question. In current interface, we will invoke cmp_all() in fmt_sefl_test(), benchmark_format() and in real crack function. First problem is to distinguish these functions and only caching useful loaded hashes. And the next problem is what can we get from merging GPU code of cmp_all() and crypt_all(). From profiler, the cmp_all() takes 1% or below GPU time(Include memcpyHtoD). It will not impact on performance. Does it mean that it won't make sense from the merging? Thanks! Dongdong Li Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.