Date: Mon, 28 Jun 2010 16:16:47 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: bitslice DES parallelization with OpenMP Hi, You can follow Openwall on Twitter now - announcements only, very low volume - http://twitter.com/Openwall - and please retweet those whenever you can. :-) There are also several RSS feeds to choose from specified on the John the Ripper homepage - http://www.openwall.com/john/ - a feed from Twitter and several from Gmane. Now to the actual stuff: I've identified an oversight in the 1.7.6-omp-des-2 patch - all threads were using the "ones" variable from thread slot 0, which was too close to that slot's written-to fields (same page). I could correct this to use per-thread "ones", but doing so was tricky in the existing code, where the "tp" pointer is local to the "crypt body" functions, whereas the "ones" variable is needed by S-box functions (in practice they're inlined into the "crypt bodies", but this does not help with variable scope at the source code level). So I've hacked around this differently in 1.7.6-omp-des-3 - by keeping this variable separate and at a distance from the thread slots. Hopefully, the cache associativity is sufficient, and it's at most 1 cache line to bounce anyway. The new patch is uploaded to the wiki: http://openwall.info/wiki/john/patches and I've also attached a diff against -omp-des-2 to this message. Another change is setting "Idle = N" instead of the old default of "Idle = Y". This makes John less sensitive to other server load, but unfortunately it also makes John less friendly to those other tasks. The speed has improved - see below. On Mon, Jun 28, 2010 at 02:44:02AM +0400, Solar Designer wrote: > Next test system, Core i7 920 2.67 GHz (quad-core, 2 logical CPUs per > core) with a little bit of other load (perhaps under 5%). Normally, it > does around 2600K c/s for multi-salt with one non-parallelized process, > 11500K c/s combined with 8 separate processes. > > Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE > Many salts: 9802K c/s real, 1237K c/s virtual > Only one salt: 4683K c/s real, 585472 c/s virtual Same system, new benchmark: Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 10025K c/s real, 1251K c/s virtual Only one salt: 4786K c/s real, 597525 c/s virtual This is due to the "ones" change only. (The "Idle" setting has no effect on "--test" benchmarks. It only comes into play during actual John runs.) > guesses: 14 time: 0:00:00:34 c/s: 9491K trying: maiemaie - mbvnmbvn We're getting better actual speed now as well: host!solar:~/john/john-1.7.6-omp-des-3/run$ ./john -e=double --salts=-2 ~/john/pw-fake-unix Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16]) mimi (u3044-des) aaaa (u1638-des) xxxx (u845-des) aaaaaa (u156-des) bebe (u1731-des) gigi (u2082-des) jojo (u3027-des) lulu (u3034-des) booboo (u171-des) cloclo (u2989-des) cccccc (u982-des) guesses: 11 time: 0:00:00:00 c/s: 9183K trying: iciici - jprjpr jamjam (u2207-des) simsim (u2671-des) ssssss (u3087-des) guesses: 14 time: 0:00:00:03 c/s: 9777K trying: amkaamka - anxjanxj guesses: 14 time: 0:00:00:08 c/s: 9884K trying: clxcclxc - cnklcnkl guesses: 14 time: 0:00:00:24 c/s: 9930K trying: irzgirzg - itmpitmp guesses: 14 time: 0:00:00:38 c/s: 9924K trying: oejkoejk - ofwtofwt woofwoof (u1435-des) guesses: 15 time: 0:00:01:09 c/s: 9838K trying: zzwmzzwm - zzzzzzzz > Finally, a faster system - dual Xeon X5460 (8 cores total) at 3.16 GHz, > but with a bit more load (around 10%). The benchmarks vary a lot: The benchmarks still vary a lot (the load is changing), so I'll only provide a test run output (under load). Previously, I got: > guesses: 15 time: 0:00:01:23 c/s: 8203K trying: zzwmzzwm - zzzzzzzz This time, things are a lot better: host!solar:~/john$ ./john-omp-des-3 -e=double --salts=-2 pw-fake-unix Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16]) mimi (u3044-des) aaaa (u1638-des) xxxx (u845-des) aaaaaa (u156-des) bebe (u1731-des) gigi (u2082-des) jojo (u3027-des) lulu (u3034-des) booboo (u171-des) cloclo (u2989-des) cccccc (u982-des) jamjam (u2207-des) guesses: 12 time: 0:00:00:00 c/s: 16384K trying: odwodw - prfprf simsim (u2671-des) ssssss (u3087-des) guesses: 14 time: 0:00:00:03 c/s: 15406K trying: bjcubjcu - bkqdbkqd guesses: 14 time: 0:00:00:07 c/s: 14431K trying: dlqqdlqq - dndzdndz guesses: 14 time: 0:00:00:12 c/s: 14441K trying: ghwmghwm - gjjvgjjv guesses: 14 time: 0:00:00:17 c/s: 14526K trying: jfpsjfps - jhdbjhdb guesses: 14 time: 0:00:00:25 c/s: 14642K trying: nyhwnyhw - nzvfnzvf woofwoof (u1435-des) guesses: 15 time: 0:00:00:46 c/s: 14722K trying: zzwmzzwm - zzzzzzzz In this case, I think the speedup is mostly due to "Idle = N", which makes context switches less frequent (when there's other demand for CPU time) and thereby keeps the threads almost in sync more often. Enjoy, and don't forget to provide your feedback. Alexander P.S. Setting DES_BS_EXPAND_MERGED to 1 provides some speedup for the single-salt case (as expected, because the "expand" step gets parallelized then), but it slows things down a bit for the multi-salt case (at least on the Core i7 system). The latter is unexpected - if anyone can figure out the cause of the slowdown (or show that it doesn't occur on other builds/systems maybe?), please let me know. View attachment "john-1.7.6-omp-des-2to3.diff" of type "text/plain" (2344 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.