Date: Wed, 30 Jun 2010 23:46:32 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: bitslice DES parallelization with OpenMP On Tue, Jun 29, 2010 at 10:12:15PM -0500, Joshua J. Drake wrote: > There seems to be a huge slow down when I try with a few cores under > high load.. I was actually getting worse performance than running a > single instance by itself. Confirmed, and this is not limited to the DES code, nor to JtR. This turns out to be a known problem with libgomp: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 The fixes committed with reference to this bug do not actually fix the problem (they fix another related problem), as correctly pointed out by the original reporters. Setting any one of the following environment variables - or a combination - is a workaround that brings performance back to a reasonable level (not great, just "reasonable") despite of having other program(s) busily running on some CPUs: GOMP_SPINCOUNT=10000 # When waiting, don't spin for too long GOMP_CPU_AFFINITY=0-99 # Forcibly bind threads to CPUs sequentially OMP_WAIT_POLICY=PASSIVE # Avoid spinning OMP_NUM_THREADS=7 # On an 8-core system, voluntarily only use 7 threads Which one of these (or which combination) produces best results varies across systems, kinds of other load, and JtR invocation. Overall, OpenMP behaves poorly when there's other load on the system. I'll continue to be trying to make the code less sensitive to other load, but at a later point I expect to need to introduce another parallelization approach. Curiously, setting GOMP_SPINCOUNT=10000 on a system with SMT also significantly improves speed for the single-salt case even with almost no other load. On the Core i7 system (the same one I posted benchmarks for previously), I am getting: host!solar:~/john/john-1.7.6-omp-des/run$ GOMP_SPINCOUNT=10000 ./john -te=1 -fo=des Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 10174K c/s real, 1270K c/s virtual Only one salt: 6045K c/s real, 1149K c/s virtual The old benchmark was (without GOMP_SPINCOUNT override): Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 10174K c/s real, 1267K c/s virtual Only one salt: 4841K c/s real, 602923 c/s virtual Both are currently reproducible (multiple times). I think the speedup with lower GOMP_SPINCOUNT is attributable to Core i7's SMT (two threads per core), where having a waiting thread yield the CPU actually frees CPU resources up for use by other threads. This is confirmed by OMP_WAIT_POLICY=PASSIVE: host!solar:~/john/john-1.7.6-omp-des/run$ OMP_WAIT_POLICY=PASSIVE ./john -te=1 -fo=des Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 7962K c/s real, 1731K c/s virtual Only one salt: 5182K c/s real, 1279K c/s virtual Notice how the "c/s virtual" of 1279K is similar to that for the GOMP_SPINCOUNT=10000 run. This must be what it becomes when other threads are not waiting busily. However, only GOMP_SPINCOUNT=10000 provides an overall speedup, meaning that going for passive waits is not always optimal. Thank you for testing and for your feedback! Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.