john-dev - Re: Result of hard core password generation on 7970

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALaL1qAc9GtOv5S=O9KH-egP1O8AEwzX=viX0vZio29V-NPa1Q@mail.gmail.com>
Date: Wed, 25 Jul 2012 10:05:06 -0700
From: Bit Weasil <bitweasil@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Result of hard core password generation on 7970

I do not know if my previous email made it through.  I got an error from
the mailing list server regarding a disk space error.

Apologies if it goes through twice.

The GPU is not a CPU - you cannot treat it like one!  You cannot safely
treat it as "a bunch of CPUs running in parallel" - this leads to memory
contention.  It must be coded as a very wide vector engine.

The "hard coded" MD5 kernel is written like CPU code, not like GPU code.

I see no use of local memory.  This is very bad.  Global memory is high
bandwidth, but very high latency.

You're doing a linear search through the main global memory to check
passwords, as far as I can tell.  From each thread.  On many AMD GPUs, this
does not broadcast the read (I believe nVidia will, at least on newer GPUs).

Further, there's not a lookup bitmap in sight.  They're used for very good
reasons, and are absolutely critical for good performance on the GPU.  I'm
using a 3 layer bitmap system (local, global-but-cached,
global-and-not-cached) to make sure I only do a binary search through the
sorted hashes if there's a very high probability of finding the hash.  A
walk through global memory space from one thread is incredibly expensive.

There's also just a compare of the first 32-bit value.  This will match the
target hash once every 4B values (or, at least with my code, more than once
per second).

I've also seen, in other kernels, what appears to be OpenSSL code, ported
nearly directly, to the GPU.  It's slow on the CPU, and worse on the GPU.

You cannot simply port CPU concepts the GPU and expect good performance.

I will point to the Cryptohaze OpenCL kernels as reasonably fast OpenCL
kernels.  I am faster than hashcat-plus on 6xxx series GPUs by a good
margin, and slower on 7970s (but only recently obtained a 7970 to develop
with).

They look nothing like traditional CPU code, though they do perform fairly
well on the CPU as an OpenCL target.

You also must have variable work sizes if you wish to run on anything other
than a specific GPU.  Locking up an end user's display is unacceptable (and
will get your kernel killed), and as noted, ASIC hangs are the penalty for
running too long on AMD devices.

I would strongly recommend at least reading through the nVidia CUDA
developer's guide - it's a good overview.  The AMD OpenCL dev guides go
into much greater detail about the AMD GPUs.

"Tweaking" the current hard coded MD5 kernel will not result in very good
performance.

Benchmarking my non-7970-tuned code on a 7970, I get the following speeds:
All length 8, full US charset (95):

Speeds reported are stepping rate (passwords per second), not hashes
checked per second (so not directly comparable to your reported speeds).

1 hash: 5.75B (5 750 000 000) per second
1000 hashes: 5.55B (5 550 000 000) per second (against the full list of
1000 hashes)
1M hashes: 4.00B (4 000 000 000) per second (against the full list of 1M
hashes)

For comparison, my nVidia GTX470 (running the same OpenCL code, which is
not yet optimized for non-vector cards), returns the following:
1 hash: 1.14B (1 140 000 000)
1000 hashes: 1.12B (1 120 000 000)
1M hashes: 970M (970 000 000)


I'm not sure what the 196M/sec single hash rate was on, but my Intel i7 CPU
turns around 180M (180 000 000) MD5 per second single hash, so for single
hash it has just edged out my CPU.

For 1000 hashes, my CPU will sustain roughly 175M (175 000 000) per second.

Just some data points.

On Wed, Jul 25, 2012 at 5:30 AM, Solar Designer <solar@...nwall.com> wrote:

> myrice -
>
> On Wed, Jul 25, 2012 at 06:12:00PM +0800, myrice wrote:
> > Here is the new rough result
> >
> > 1: with 2048*8, ~900M c/s, but with 2048*16, it is ~500M c/s
>
> OK, this is starting to become reasonable.  Have you also tried values
> smaller than 2048*8?
>
> > 1000: with 2048*8, ~45G = ~45M, with 2048*16, ~90G = ~ 45M
>
> I understand what you mean by "~45G = ~45M", but why "~90G = ~ 45M"?
> I think you're confused.  At 1000 same-salt hashes, "90G" reported
> effective speed (combinations per second) means 90M hashes computed per
> second.  max_keys_per_crypt is not part of that equation.
>
> (I definitely need to improve speed reporting to avoid such confusion.)
>
> > 1M: still cannot get, I reduce the global work size to 128 and only
> > append[a-e][a-e], it is very slow. And the kernel cannot finished with
> > [a-z][a-z].
> >
> > I am think about compare, with 1M hashes, the loop inside one thread
> > increased to 26*26*1M = 676M, it is very large for a thread.
>
> As discussed, you absolutely must implement bitmaps and hash tables on
> GPU.  Your direct comparisons are only good for very small numbers of
> loaded hashes and for early experiments with larger numbers, like what
> you're doing now.  As you've reached this milestone, you should now
> proceed further - to bitmaps and hash tables.
>
> Thanks,
>
> Alexander
>

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.