Date: Sun, 4 Jul 2010 00:47:59 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: bitslice DES parallelization with OpenMP Hi, How do you like 356M c/s at LM hashes on a single system (no GPU, seriously), and 15M c/s at traditional DES-based crypt(3) with just a single hash being attacked (the slower case)? And that's with no hand-crafted assembly code at all. Please read on. ;-) This time, I came up with and implemented a new bitslice DES key setup algorithm (currently with SSE2 intrinsics only - a proof of concept). Although the new algorithm was meant to allow for vectorization and parallelization - and it does - it's also a lot faster even in a single process, single thread build (no OpenMP involved), albeit with SSE2 vectorization where appropriate. Here are the new single process, single thread benchmarks on Core i7 920 2.67 GHz: Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 2573K c/s real, 2573K c/s virtual Only one salt: 2484K c/s real, 2484K c/s virtual At first glance, this one is not exciting, but notice how the gap between multi-salt and single salt speed became much narrower. This will be important for subsequent parallelization (see below). Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 38781K c/s real, 38781K c/s virtual And this one is exciting: it is more than twice faster than what a clean 1.7.6 achieves on this machine (around 17.5M c/s at LM). That's because LM hash computation is mostly about key setup (or at least it was prior to this change). Another exiting result: with 8 separate processes running on the Core i7 at once, I am getting a combined c/s rate of 173M c/s at LM hashes. For dual Xeon X5460 3.16 GHz, the numbers are 45M c/s (non-OpenMP build, single process, single thread) and as much as 356M c/s combined for 8 simultaneous processes (still no OpenMP). This is with john-1.7.6-fast-des-key-setup-3.diff.gz, which I uploaded to the wiki: http://openwall.info/wiki/john/patches I am also uploading these patches to: http://download.openwall.net/pub/projects/john/contrib/ so they get propagated to the mirrors. Now let's get back to OpenMP. 1.7.6-omp-des-7, also uploaded to the wiki, integrates the DES key setup code above and parallelizes it. For DES-based crypt(3), things are quite good. On the Core i7: $ GOMP_SPINCOUNT=10000 ../run/john -te -fo=des Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 9584K c/s real, 1202K c/s virtual Only one salt: 8565K c/s real, 1165K c/s virtual During the first part of the benchmark (multi-salt), system load stays at over 99%. During the second part, it drops to 92% (leaving 8% idle), which is quite normal given that some code that runs per-key hasn't been parallelized (doing so wouldn't fit well in the OpenMP model). With 8 separate non-OpenMP-enabled processes, the same system does around 10800K c/s combined for the single salt case, so the efficiency of the OpenMP-enabled build at this test is around 80%. Unfortunately, the speed for multi-salt has dropped from around 10200K for -omp-des-4 to around 9600K. I previously traced this to the move of the "expand key" step from DES_bs.c into "crypt bodies" in DES_bs_b.c, but why this causes a slowdown is a mystery to me. I haven't investigated this further yet - if anyone does, please let me know. Meanwhile, I am keeping the older -omp-des-4 patch on the wiki as well. I had to make this "expand key move" in this newer patch in order to move key setup inside the parallel region, which was required to achieve decent performance for single-salt. Let's see if it's achieved in practice: $ GOMP_SPINCOUNT=10000 ./john -i=alpha pw Loaded 1 password hash (Traditional DES [128/128 BS SSE2-16]) guesses: 0 time: 0:00:00:01 c/s: 2850K trying: lkuku - botink guesses: 0 time: 0:00:00:05 c/s: 6596K trying: droopapr - stpgehed guesses: 0 time: 0:00:00:14 c/s: 7464K trying: fedsuyog - fesidanc guesses: 0 time: 0:00:00:28 c/s: 7788K trying: vwhwca - vwcvfr guesses: 0 time: 0:00:00:53 c/s: 7965K trying: dcerkncs - alecismc guesses: 0 time: 0:00:01:09 c/s: 8002K trying: sejaxn - siqtui guesses: 0 time: 0:00:01:45 c/s: 8071K trying: eicomdm - eicvdic guesses: 0 time: 0:00:02:21 c/s: 8103K trying: huaugost - huluigra apmahoop (user) guesses: 1 time: 0:00:02:41 c/s: 8110K trying: apffjkbt - acchucrm Yes, it is. (It is normal for "incremental" mode, when not locked to a specific length, to take a while to get up to full speed. This is because length switches involve overhead, but those become very infrequent when running for a while. The speed would continue to increase even further if the above session were running for longer.) Now to LM hash parallelization. In short, there's some speedup from parallelization, but things are not great - and my recommendation is to run separate processes. Here's what I am getting on the Core i7 with 1.7.6-omp-des-7 running 8 threads: $ GOMP_SPINCOUNT=10000 ../run/john -te=1 -fo=lm Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 65126K c/s real, 16120K c/s virtual That's about the best I can achieve with the current code. It leaves about 50% of CPU time idle (although due to SMT this corresponds to less than 50% of actual CPU resources staying idle). By running two 4-thread processes simultaneously, I achieve a combined c/s rate of 90M c/s: Raw: 45386K c/s real, 17443K c/s virtual Raw: 45386K c/s real, 17429K c/s virtual - Done GOMP_SPINCOUNT=10000 OMP_NUM_THREADS=4 ./john -te -fo=lm + Done GOMP_SPINCOUNT=10000 OMP_NUM_THREADS=4 ./john -te -fo=lm That's better, but it's still very far from the combined c/s rate for 8 separate non-OpenMP-enabled processes, which is as high as 173M c/s. Now let's see how this behaves in actual usage. I'll use just one hash for this test for clarity, although multi-hash benchmarks are also very interesting and are of more practical relevance. Those show similar results ("trust me" - or better try on your own and post in here). First, just the new key setup (1.7.6-nsk-3), no OpenMP yet, back on the Core i7 machine: $ ./john -i -u=u3047 ~/john/pw-fake-nt Loaded 1 password hash (LM DES [128/128 BS SSE2-16]) guesses: 0 time: 0:00:00:02 c/s: 19644K trying: 09ACEBU - 09ACOS3 guesses: 0 time: 0:00:00:08 c/s: 26005K trying: BBOSNCR - BBOTPHA guesses: 0 time: 0:00:00:16 c/s: 28044K trying: PW0DSK - PW0D72 guesses: 0 time: 0:00:00:20 c/s: 28688K trying: COHJF9 - COVDFT guesses: 0 time: 0:00:00:27 c/s: 29225K trying: 34JYDW - 34JY31 guesses: 0 time: 0:00:00:43 c/s: 30049K trying: KS54WG - KS58TL guesses: 0 time: 0:00:01:06 c/s: 30444K trying: 408I1$O - 407929T NE1410S (u3047) guesses: 1 time: 0:00:01:11 c/s: 30387K trying: NECQ9SU - NE1411B This achieved over 30M c/s out of the "theoretical maximum" of 39M c/s. With a longer session (or with a length lock), the speed would be slightly higher. Note that only one CPU core was in use by John in the above run. Now 1.7.6-omp-des-7 running 8 threads, same machine: $ GOMP_SPINCOUNT=10000 ./john -i -u=u3047 ~/john/pw-fake-nt Loaded 1 password hash (LM DES [128/128 BS SSE2-16]) guesses: 0 time: 0:00:00:00 c/s: 6156K trying: PY2TY - M02TS guesses: 0 time: 0:00:00:04 c/s: 31746K trying: MW2C90 - MWIZDK guesses: 0 time: 0:00:00:11 c/s: 39202K trying: N1BAKK - NGCHCY guesses: 0 time: 0:00:00:17 c/s: 41378K trying: 0BAYBBS - 0BLMSD1 guesses: 0 time: 0:00:00:25 c/s: 43049K trying: 20RPPAD - 20MMNTS guesses: 0 time: 0:00:00:36 c/s: 44219K trying: ALY2AS8 - AR62899 NE1410S (u3047) guesses: 1 time: 0:00:00:48 c/s: 44756K trying: NECDRD3 - NE10TWO Yes, it's 47% faster, which is consistent with "--test" benchmarks. Those reported more like a 68% speedup, but we know that performance with OpenMP is less stable (it's very sensitive to system load) and that there's cracking-mode-specific overhead (not parallelized), which is not included in the "--test" benchmarks. So it is not surprising to see a difference like this for this very fast hash type. For slower hash types, "--test" benchmarks match actual runs much closer. BTW, the specific password "NE1410S" would normally be cracked instantly (given its LM hash) because it is found on JtR's bundled common passwords list, but I intentionally forced JtR to use the "incremental" mode to have this session take a little while. I did not even give JtR any hint on the character set used. Finally, some quick benchmarks for dual Xeon X5460 3.16 GHz (briefly mentioned above), under some unrelated load: Just the new key setup (no OpenMP), only one CPU core in use by John: Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 2982K c/s real, 2982K c/s virtual Only one salt: 2880K c/s real, 2880K c/s virtual Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 45243K c/s real, 45243K c/s virtual 1.7.6-omp-des-7: Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 20054K c/s real, 2506K c/s virtual Only one salt: 15532K c/s real, 1941K c/s virtual That's over 15M c/s for the single-salt case. Quite good. Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 63799K c/s real, 7974K c/s virtual Good speed, but nothing exciting now that we've already seen Core i7 perform a bit better. I think it's the system load (in combination with this very fast hash) that prevented the Xeon from showing a better result. This is as little as 18% of what this same system achieves with 8 separate non-OpenMP-enabled processes (64M out of 356M). It should be possible to achieve a better result on an idle dual-Xeon system - 100M c/s anyone? As usual, feedback is welcome. I'd like this stuff to become a bit more mature before I approach integrating it. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.