john-users - Re: CUDA the Ripper

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080602043707.GA21339@openwall.com>
Date: Mon, 2 Jun 2008 08:37:07 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: CUDA the Ripper

On Mon, Jun 02, 2008 at 12:32:45AM +0400, Alex V. Breger wrote:
> Are there any attempts to use GPU computing for John the Ripper?

I was not aware of such attempts (specific to JtR) until you mentioned yours.

> There was some problems with bench.c and incremental cracker.
> Benchmark can't get a full speed, which measured by real hash
> cracking after a some time.

I'm not sure what you mean here.  Are you saying that "--test" does not
report "full speed" as measured by something else (by what?) or that
"incremental mode" does not achieve "full speed" as reported by "--test"?

> How does john calculate a speed of hash generation?

It's somewhat different for "--test" and actual cracking, although with
"--test" JtR does try to simulate real-world scenarios (including for one
vs. many salts, when applicable).  I may be able to give a more specific
answer if you make your question more specific. ;-)

> I've noticed some inertness - speed is slowly growing with time.

You're probably talking of "incremental mode" here.  If so, this is
addressed in the FAQ:

Q: I just noticed that the c/s rate reported while using "incremental"
mode is a lot lower than it is with other cracking modes.  Why?
A: You're probably running John for a few seconds only.  The current
"incremental" mode implementation uses large character sets which need
to be expanded into even larger data structures in memory each time John
switches to a different password length.  Fortunately, this is only
noticeable when John has just started since the length switches become
rare after a few minutes.  For long-living sessions, which is where we
care about performance the most, this overhead is negligible.  This is a
very low price for the better order of candidate passwords tried.

For benchmarking, you may create a new "incremental mode" section, where
you would set MinLen and MaxLen to the same value.  Then there would be
no length switches, although some startup overhead would remain -
expanding the tables to higher character counts.

> How fast is incremental cracker? What a maximum rate of password
> generation can it get?

It's very fast, but possibly not as fast as you would like it to be for
extremely fast saltless hashes.

With this in john.conf:

[Incremental:All8]
File = $JOHN/all.chr
MinLen = 8
MaxLen = 8
CharCount = 95

and the puts() call in cracker.c commented out, I am getting around
45M c/s after 1 minute of running with "-i=all8 --stdout" on Athlon64
3000+ 2.0 GHz, linux-x86-64 build, gcc 3.4.5.  Further speedup is
possible, for example, by dropping the external filter() support:

#if 0
	key = key_i;
	if (!ext_mode || !f_filter || ext_filter_body(key_i, key = key_e))
	if (crk_process_key(key)) return 1;
#else
	if (crk_process_key(key_i)) return 1;
#endif

This achieves 46M+ c/s with the same test.  There's some other overhead
that can be dropped or avoided as well, such as function calls between
different source files, which prevents function inlining.

Also, my hacked "--stdout" has higher overhead than a multi-key hash
implementation would have.  Specifically, with "--stdout"
status_update_crypts() and crk_fix_state() are called for each key,
whereas with a multi-key hash implementation they would be called once
per whatever number of keys is processed in one call to crypt_all().

> For CUDA I use a big sets of password (from tens of hundreds to
> several millions) to transfer
> to GPU for processing.
> I think, that bottleneck for now is incremental cracker or my
> _set_key() function. Transferring data to GPU also can be a bottleneck.

This sounds reasonable.  Yes, for extremely fast implementations of fast
saltless hashes, you may have to make some low-level optimizations to
the normally high-level code in JtR, or maybe implement some of it right
on the GPU.  You may also make use of multiple CPU cores, running
separate instances of the "incremental cracker" on them (e.g., skipping
over order[] entries that are to be processed by other CPU cores) - if
one core can do 50M c/s, then you get 200M c/s on a quad-core - which
may be enough to make full use of the GPU.

I would probably concentrate on slower and/or salted hashes, though.
Implement those on the GPU.  That's where JtR's ability to generate
candidate passwords in an "intelligent" way and the GPU's processing
power are most helpful.

Thanks,

Alexander

-- 
To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply
to the automated confirmation request that will be sent to you.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.