Date: Wed, 25 Jul 2012 10:05:06 -0700 From: Bit Weasil <bitweasil@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Result of hard core password generation on 7970 I do not know if my previous email made it through. I got an error from the mailing list server regarding a disk space error. Apologies if it goes through twice. The GPU is not a CPU - you cannot treat it like one! You cannot safely treat it as "a bunch of CPUs running in parallel" - this leads to memory contention. It must be coded as a very wide vector engine. The "hard coded" MD5 kernel is written like CPU code, not like GPU code. I see no use of local memory. This is very bad. Global memory is high bandwidth, but very high latency. You're doing a linear search through the main global memory to check passwords, as far as I can tell. From each thread. On many AMD GPUs, this does not broadcast the read (I believe nVidia will, at least on newer GPUs). Further, there's not a lookup bitmap in sight. They're used for very good reasons, and are absolutely critical for good performance on the GPU. I'm using a 3 layer bitmap system (local, global-but-cached, global-and-not-cached) to make sure I only do a binary search through the sorted hashes if there's a very high probability of finding the hash. A walk through global memory space from one thread is incredibly expensive. There's also just a compare of the first 32-bit value. This will match the target hash once every 4B values (or, at least with my code, more than once per second). I've also seen, in other kernels, what appears to be OpenSSL code, ported nearly directly, to the GPU. It's slow on the CPU, and worse on the GPU. You cannot simply port CPU concepts the GPU and expect good performance. I will point to the Cryptohaze OpenCL kernels as reasonably fast OpenCL kernels. I am faster than hashcat-plus on 6xxx series GPUs by a good margin, and slower on 7970s (but only recently obtained a 7970 to develop with). They look nothing like traditional CPU code, though they do perform fairly well on the CPU as an OpenCL target. You also must have variable work sizes if you wish to run on anything other than a specific GPU. Locking up an end user's display is unacceptable (and will get your kernel killed), and as noted, ASIC hangs are the penalty for running too long on AMD devices. I would strongly recommend at least reading through the nVidia CUDA developer's guide - it's a good overview. The AMD OpenCL dev guides go into much greater detail about the AMD GPUs. "Tweaking" the current hard coded MD5 kernel will not result in very good performance. Benchmarking my non-7970-tuned code on a 7970, I get the following speeds: All length 8, full US charset (95): Speeds reported are stepping rate (passwords per second), not hashes checked per second (so not directly comparable to your reported speeds). 1 hash: 5.75B (5 750 000 000) per second 1000 hashes: 5.55B (5 550 000 000) per second (against the full list of 1000 hashes) 1M hashes: 4.00B (4 000 000 000) per second (against the full list of 1M hashes) For comparison, my nVidia GTX470 (running the same OpenCL code, which is not yet optimized for non-vector cards), returns the following: 1 hash: 1.14B (1 140 000 000) 1000 hashes: 1.12B (1 120 000 000) 1M hashes: 970M (970 000 000) I'm not sure what the 196M/sec single hash rate was on, but my Intel i7 CPU turns around 180M (180 000 000) MD5 per second single hash, so for single hash it has just edged out my CPU. For 1000 hashes, my CPU will sustain roughly 175M (175 000 000) per second. Just some data points. On Wed, Jul 25, 2012 at 5:30 AM, Solar Designer <solar@...nwall.com> wrote: > myrice - > > On Wed, Jul 25, 2012 at 06:12:00PM +0800, myrice wrote: > > Here is the new rough result > > > > 1: with 2048*8, ~900M c/s, but with 2048*16, it is ~500M c/s > > OK, this is starting to become reasonable. Have you also tried values > smaller than 2048*8? > > > 1000: with 2048*8, ~45G = ~45M, with 2048*16, ~90G = ~ 45M > > I understand what you mean by "~45G = ~45M", but why "~90G = ~ 45M"? > I think you're confused. At 1000 same-salt hashes, "90G" reported > effective speed (combinations per second) means 90M hashes computed per > second. max_keys_per_crypt is not part of that equation. > > (I definitely need to improve speed reporting to avoid such confusion.) > > > 1M: still cannot get, I reduce the global work size to 128 and only > > append[a-e][a-e], it is very slow. And the kernel cannot finished with > > [a-z][a-z]. > > > > I am think about compare, with 1M hashes, the loop inside one thread > > increased to 26*26*1M = 676M, it is very large for a thread. > > As discussed, you absolutely must implement bitmaps and hash tables on > GPU. Your direct comparisons are only good for very small numbers of > loaded hashes and for early experiments with larger numbers, like what > you're doing now. As you've reached this milestone, you should now > proceed further - to bitmaps and hash tables. > > Thanks, > > Alexander > Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.