Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 16 Oct 2005 18:44:09 -0700
From: h1kari <0x31337@...il.com>
To:  john-users@...ts.openwall.com
Subject: Re: Using Hardwareaccelerators to speed up John

Thanks so much for contacting me regarding my work. I really look
forward to working with you guys and discussion this more.

I would like to initially comment that I have more detailed information
on my work on a website I just put up: http://www.openciphers.org

I also have all of my source code published on sourceforge if you guys
want to look at it. Most if it is based off of a modified version of
Rudi Usselman's opencores DES core. Oh, and my newer research has been
in cracking Lanman/NTLM passwords, so my older Unix DES code isn't on
there currently. My comments are inline:

Solar Designer wrote:
>>>1. General-purpose FPGA-based boards.  These would need to be programmed
>>>for the very specific task.  I briefly evaluated this possibility back
>>>in 1998-1999 and it appeared that FPGAs would deliver roughly 5 times
>>>better DES performance for the money, compared against the most suitable
>>>CPUs (at the time, that was Alpha 21164PC - affordable and really good
>>>at bitslice DES).  I used retail prices; the improvement could be a lot
>>>better for large quantities.

I know that some of the statistics in my older presentations were a bit
off. Currently right now on our (Pico E-12) LX25 boards we are able to
clock our design at either 125MHz with one DES core cracking 128 hashes
in parallel or the core instantiated 4 times cracking one hash at 125MHz
* 4 per second. For Unix DES, it would essentially be the Lanman
performance / 25, since Unix DES requires 25 rounds, so the max
performance of our card is currently ~50M c/s, which is a little less
than my projected number in the slides. I'd also like to note that the
clock speed should be able to to be increased with additional cooling
and/or higher FPGA speedgrades. Currently it's limited to 125MHz because
the chip goes into thermal runaway if it's clocked higher without
additional cooling (Synthesis says that it should run 200MHz+).

> Indeed, it has.
> 
> However, my estimate from 6+ years ago ("5 times better DES performance
> for the money") appears to still hold true for low-end FPGAs.
> 
>>Also have a look at the slides:
>>http://www.ccc.de/congress/2004/fahrplan/files/340-fpga-slides.pdf
> 
> On slide 35, "Password File Cracker", the following performance numbers
> are given:
> 
> "PC (3.0Ghz P4 \w john)" - 300,000 c/s
> "Hardware (Low end FPGA \w jawn)" - 4,000,000 c/s
> 
> I am assuming that these are for traditional DES-based crypt(3), which
> is 25 iterations of modified-DES.
> 
> My guess is that the 3 GHz P4 benchmark was done with John 1.6, which
> did not yet use bitslice DES on x86 processors.  Current publicly
> available development versions of John do around 700k c/s on 3 GHz P4s.
> Current non-public development versions of John do around 900k c/s with
> SSE code on the same P4s.  (And as I have already mentioned, PPC G5s are
> even faster than that - up to 1.6M c/s - but they're more expensive.)
> 
> So this gives us a 5x performance increase.  As it relates to prices,
> low-end computers based on P4 Celerons (which are not any slower than
> "full" P4s for John) are likely cheaper than low-end general-purpose
> FPGA-based cards, both in retail quantities.

There are newer slides and source on openciphers. A lot of the
information I provided in my older talks was work-in-progress statistics.

>>David Hulton claimed you'd be able to "crack password hashes as fast as
>>100+ PCs using FPGA PCMCIA cards on your laptop".
> 
> This claim could become the reality, but we're not quite there yet.
> 
> Slide 36, "Up & Coming", gives an estimate of 60M c/s for Picomon, the
> most powerful card (of those listed) that would be usable in a laptop.
> Given that the cards currently available from Pico Computing are priced
> at around $2,500, my guess is that this new card, when released, is not
> going to be cheaper.  (I'm sure it's a lot cheaper to produce, but
> companies such as Pico Computing need to cover their development costs
> and make a profit.)
> 
> If we're comparing this against desktop PCs, a similarly priced one
> would be Apple's PowerMac G5 with dual 2.7 GHz processors.  It can do
> over 3M c/s.  So the FPGA-based card is "only" 20 times faster (which is
> still a lot!), not 100+ times faster.

That is also with Xilinx's lowest end Virtex-4 FPGA. Our newer boards
will feature up to a FX60 which should increase the performance by at
least double, and we'll be able to interface with 2 onboard powerpc
processors to do software acceleration, which has been one of my main
goals that I haven't been able to implement yet.

>>See http://www.ccc.de/congress/2004/fahrplan/event/244.en.html
>>(IIRC, "basic functionality of john the ripper" merely implemented
>>brute forcing a part of the key space.
> 
> This is a very important observation you make.  It's not only about c/s
> rates, but also about the order in which candidate passwords are tried.
> Much of John's success is due to its ability to try candidate passwords
> in an optimal order.
> 
> With a PCMCIA card capable of hashing candidate passwords at a rate of
> 60 million per second, either the card itself will have to generate the
> candidate passwords to try (in a far less optimal order) or the laptop's
> CPU would become the bottleneck since it wouldn't be able to feed the
> card with candidate passwords at this high a speed.
> 
> A similar problem exists with testing the computed hashes against a
> large number of those loaded for cracking.  Perhaps the hash tables
> (used to quickly locate potentially matching hashes) will have to be
> loaded onto the FPGA card.

Yeah. I'm sorry it ended up coming out comparing directly to the
functionality of John. The idea I was trying to get across was that when
most people think of password cracking, they think of john, and I was
doing something similar. Ideally, I'd like to have the FPGA act as a
hardware accelerator plugin for John and be able to directly enhance the
speed of checking based on intelligent wordlists. Right now our only
nitch with this project is for passwords that can't be easily cracked by
John or L0phtcrack.

>>But it should also be possible to let john create the password
>>candidates, and calculate (and compare) the hashes using FPGA
>>hardware, using an order of magnitude larger MAX_KEYS_PER_CRYPT
>>value than for general purpose CPUs.)
> 
> Actually, fully-pipelined implementations of DES are not small, so you
> can only fit a handful of them onto current FPGAs.  If I interpret the
> numbers from David's slides correctly, he has been assuming 1 to 5
> instances of DES per chip.  So MAX_KEYS_PER_CRYPT would need to be
> rather small, unless a larger value is determined to help reduce the
> communication overhead, etc.

Yeah. The 16-stage pipeline method that most people use takes up roughly
20% of the LX25 FPGA, mostly because of the S-Boxes. When you're talking
about Unix DES, you have to feed in 16 passwords and wait 16*26 clock
cycles to get all of the hashes out the other side.

> David,
> 
> Some more comments on your publications:
> 
> http://www.picocomputing.com/press/KeyRecoveryServer.pdf
> 
> "World's Fastest Lanman/NTLM Key Recovery Server Shipped."
> 
> This press release says that the server can try over 500M LM keys per
> second.  Very impressive indeed.  However, the claims that this is "250x
> faster than a top of the line CPU" and the "12 hours vs. 136 days" once
> again assume unoptimal software.  (I am sure this is unintentional.)
> 
> The current publicly available development version of John can do around
> 7M c/s at LM on modern P4s (2.8+ GHz) and 9.5M c/s at LM on G5 2.7 GHz.
> So your special-purpose server (with 10 FPGA cards) appears to be
> 50 to 70 times faster than individual general-purpose CPUs.  (Curiously
> enough, my "5x speedup" estimate from 6+ years ago still holds true.)
> 
> Also, I believe John's performance at LM hashes could be made 2-3 times
> better if I would re-design it to try candidate passwords in an order
> that is optimal for the low-level routines (essentially eliminating key
> setup overhead).  Currently, John tries more likely passwords first,
> which is highly desirable when using it to detect weak passwords.
> Exhaustive key searches with no requirement to get the weakest passwords
> cracked early on are quite a different task.
> 
> Please don't get me wrong, I find that you're doing the right thing and
> I'd be interested in possible cooperation.  I just couldn't resist the
> temptation to defend my software. ;-)

I'm defintiely glad that we're having this discussion. I think that we
ended up testing this against rainbowcrack or l0phtcrack when we did a
speed comparison. You should also note that the server that we built
didn't have optimal cooling, so we ended up running all of the cards at
50MHz to make sure they operated consistently out in the field. Keep in
mind that without the 128 parallel compares on the boards, we have a
mode you can use to run 4 cores in parallel on each card providing 4x
the 500M LM keys performance that we mentioned in the press release.

We're working on a new prototype now that provides cooling for each of
our boards that should allow us to additionally clock all of the boards
at at least 100MHz, which would double the speeds.

Anyway, I definitely didn't mean to knock the work you guys are doing. A
lot of the benchmarking stuff is a little murky and it was hard to find
specific benchmarks from the different open source projects. If you guys
could provide some specs, I would really like to setup a page that
provides accurate performance information from all of the projects
including rainbowcrack/l0phtcrack/etc, or maybe there's a resource
already out there for that.

As far as future work. We've been doing a lot of research with the
Virtex-4 FX cards and the onboard PowerPCs and we see a lot of potential
for using the APU bus to provide custom instructions to software (john)
that would allow you to accelerate your DES and other functions with
single instruction calls. I don't know how much this would speed up john
considering the onboard PowerPCs can only be clocked up to 450MHz, but
it seems like it would at least be a bit of a speed improvement over
doing the crypto in software. Your comments on this would be really
appreciated.

Also, if we were able to provide the hardware end of this to you guys,
would you be able interested in tying it into john? Also, for the
record, I'm perfectly fine with you guys running my code on any other
FPGA boards, and there are plenty of other ones out there that are a lot
cheaper than the ones that we sell. It would be really cool if FPGA
crypto acceleration started getting more mainstream, so I'd totally
support you guys if you want to port this to other cheaper boards.

Thanks,
-David

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.