Date: Mon, 2 Jun 2008 08:37:07 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: CUDA the Ripper On Mon, Jun 02, 2008 at 12:32:45AM +0400, Alex V. Breger wrote: > Are there any attempts to use GPU computing for John the Ripper? I was not aware of such attempts (specific to JtR) until you mentioned yours. > There was some problems with bench.c and incremental cracker. > Benchmark can't get a full speed, which measured by real hash > cracking after a some time. I'm not sure what you mean here. Are you saying that "--test" does not report "full speed" as measured by something else (by what?) or that "incremental mode" does not achieve "full speed" as reported by "--test"? > How does john calculate a speed of hash generation? It's somewhat different for "--test" and actual cracking, although with "--test" JtR does try to simulate real-world scenarios (including for one vs. many salts, when applicable). I may be able to give a more specific answer if you make your question more specific. ;-) > I've noticed some inertness - speed is slowly growing with time. You're probably talking of "incremental mode" here. If so, this is addressed in the FAQ: Q: I just noticed that the c/s rate reported while using "incremental" mode is a lot lower than it is with other cracking modes. Why? A: You're probably running John for a few seconds only. The current "incremental" mode implementation uses large character sets which need to be expanded into even larger data structures in memory each time John switches to a different password length. Fortunately, this is only noticeable when John has just started since the length switches become rare after a few minutes. For long-living sessions, which is where we care about performance the most, this overhead is negligible. This is a very low price for the better order of candidate passwords tried. For benchmarking, you may create a new "incremental mode" section, where you would set MinLen and MaxLen to the same value. Then there would be no length switches, although some startup overhead would remain - expanding the tables to higher character counts. > How fast is incremental cracker? What a maximum rate of password > generation can it get? It's very fast, but possibly not as fast as you would like it to be for extremely fast saltless hashes. With this in john.conf: [Incremental:All8] File = $JOHN/all.chr MinLen = 8 MaxLen = 8 CharCount = 95 and the puts() call in cracker.c commented out, I am getting around 45M c/s after 1 minute of running with "-i=all8 --stdout" on Athlon64 3000+ 2.0 GHz, linux-x86-64 build, gcc 3.4.5. Further speedup is possible, for example, by dropping the external filter() support: #if 0 key = key_i; if (!ext_mode || !f_filter || ext_filter_body(key_i, key = key_e)) if (crk_process_key(key)) return 1; #else if (crk_process_key(key_i)) return 1; #endif This achieves 46M+ c/s with the same test. There's some other overhead that can be dropped or avoided as well, such as function calls between different source files, which prevents function inlining. Also, my hacked "--stdout" has higher overhead than a multi-key hash implementation would have. Specifically, with "--stdout" status_update_crypts() and crk_fix_state() are called for each key, whereas with a multi-key hash implementation they would be called once per whatever number of keys is processed in one call to crypt_all(). > For CUDA I use a big sets of password (from tens of hundreds to > several millions) to transfer > to GPU for processing. > I think, that bottleneck for now is incremental cracker or my > _set_key() function. Transferring data to GPU also can be a bottleneck. This sounds reasonable. Yes, for extremely fast implementations of fast saltless hashes, you may have to make some low-level optimizations to the normally high-level code in JtR, or maybe implement some of it right on the GPU. You may also make use of multiple CPU cores, running separate instances of the "incremental cracker" on them (e.g., skipping over order entries that are to be processed by other CPU cores) - if one core can do 50M c/s, then you get 200M c/s on a quad-core - which may be enough to make full use of the GPU. I would probably concentrate on slower and/or salted hashes, though. Implement those on the GPU. That's where JtR's ability to generate candidate passwords in an "intelligent" way and the GPU's processing power are most helpful. Thanks, Alexander -- To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply to the automated confirmation request that will be sent to you.
Powered by blists - more mailing lists
Powered by Openwall GNU/*/Linux - Powered by OpenVZ