john-users - Re: Using Hardwareaccelerators to speed up John

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20051017213736.GA8488@openwall.com>
Date: Tue, 18 Oct 2005 01:37:36 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Using Hardwareaccelerators to speed up John

On Mon, Oct 17, 2005 at 10:17:49AM -0700, h1kari wrote:
> Solar Designer wrote:
> > If so, starting with a few thousand LM hashes to crack, John running
> > on a single modern CPU would outperform your implementation on a single
> > Pico E-12 LO.  Is that correct?
> 
> Yeah, I would think so. If you're comparing to a few thousand hashes
> after each computation of crypt()

Actually, for bitslice DES, John obviously only does comparisons once
for every N hashes computed, with N being the machine's word size (or
vector size, e.g. 128 for AltiVec).  For small numbers of hashes loaded
(for a given salt, if applicable), the following algorithm is used:

/*
 * The trick I used here allows to compare one ciphertext against all the
 * DES_bs_crypt() outputs in just O(log2(ARCH_BITS)) operations, assuming
 * that DES_BS_VECTOR is 0 or 1. This routine isn't vectorized, yet.
 */
int DES_bs_cmp_all(ARCH_WORD *binary)
[...]

(that's in DES_bs.c).  For AltiVec on 32-bit PowerPC, it would require
4*5 = 20 operations (on average) to check a loaded password hash against
128 computed ones.  For a plain 64-bit architecture such as Alpha, it
would require only 6 operations for the 64 "comparisons".

For larger numbers of hashes loaded (per salt, if applicable), hash
tables are used instead.

> would it slow you down a little bit
> compared to your performance info you gave earlier?

"A little bit" would be the answer.  I can run some benchmarks if you're
interested in exact numbers.  I previously did such measurements for
large numbers of crypt(3) hashes per salt, up to several thousand.  The
slowdown was barely measurable.  Of course, LM hashes are a lot quicker
and would be impacted more.

> > so are you suggesting to run an embedded Linux and John on one or both
> > of the embedded PowerPCs?
> 
> Yeah, I was suggesting that mainly because of the latency and overhead
> involved with talking to a host processor. Right now if we're generating
> 20M c/s, that's roughly 1.2Gbps fof passwords going to the card, and
> another 1.28Gbps coming back.

Yes, and that's a bit excessive.  However, we might be able to halve
those numbers by using a trivial compression (or rather, encoding)
algorithm on the candidate passwords, such as DAWG as implemented for
wordlists in Crack 5 (and have the embedded PowerPC undo that?), and we
don't really need full 64 bits of hashes on the way back.  32 bits (or
even less) would be sufficient to rule out most candidate passwords and
the rest could be tested on the host's CPU.  Then it would fit in PCI or
in Gigabit Ethernet.

> When we're talking about Lanman/NTLM it's 25x that.

This approach won't work well for non-iterated hashes.  The speedup from
it would be minimal, -- like getting the same 20M c/s for LM that we
would for crypt(3).  For those hashes, we have to let the FPGA at least
produce batches of candidate passwords, maybe implementing an equivalent
of John the Ripper "incremental" mode's inner loop.  The outer loops
could be left on the host system.  The stream of computed hashes would
remain a problem, though.

> The other thing is we could use the APU bus for
> specific operations that could be sped up inside of some of the
> algorithms, similar to your SSE or Altivec optimizations, but more
> geared towards crypto.

Yes.  Actually, we might be able to implement sort of a vector
computer, increasing vector size way beyond AltiVec's 128 bits.
Then the existing framework in John would work for it.

> Our higher end cards do have 256MB of DDR2 ram
> that runs up to 600mhz I believe, and 64MB of flash. We do have plans
> for building one in the near future with 2GB+ of user flash, which may
> be useful for wordlist storage and such.

Yes, this definitely could run John in its entirety, but it could be
somewhat tricky to manage for someone not familiar with the system.

> > Perhaps we could start by having the FPGA card do the bare minimum and
> > having John running on the host system communicate candidate passwords
> > and salts to the card and computed hashes back.  It won't be very fast,
> > but it's likely the quickest way to get us started.  We can do it for
> > just one hash type initially (perhaps one of the Unix hashes since we
> > need it to not be very fast).
> 
> Yeah, that sounds like a good starting point which would be fairly
> trivial right now. I could get something set up for you and send you a
> card + driver in the next couple of weeks if you're really interested.
> We're still working on getting linux running fully on our boards, but I
> expect APU john optimizations would be trivial once that's completed in
> the next couple of months.

Yes, I am interested.  Finding the time for this is, however, not going
to be easy.  I know I won't be able to approach this any earlier than
December.  Let's discuss this off-list.

Thanks,

-- 
Alexander Peslyak <solar at openwall.com>
GPG key ID: B35D3598  fp: 6429 0D7E F130 C13E C929  6447 73C3 A290 B35D 3598
http://www.openwall.com - bringing security into open computing environments
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.