Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Mon, 28 Jun 2010 02:44:02 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: bitslice DES parallelization with OpenMP

Hi,

I'll start with a benchmark result to get you to read the rest of this
lengthy message. ;-)

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     17444K c/s real, 2202K c/s virtual
Only one salt:  5561K c/s real, 694300 c/s virtual

Attached to this message and uploaded to the wiki is a quick and really
dirty patch, which is nevertheless a successful attempt at parallelizing
John the Ripper's bitslice DES code with OpenMP directives (requires gcc
4.2+ or the like).  So far, I've only tested this with gcc 4.5.0 and the
linux-x86-64 make target.  The current revision is 1.7.6-omp-des-2.

http://openwall.info/wiki/john/patches

This patch provides good performance for traditional DES-based crypt(3)
hashes in the multi-salt case, and even better (vs. non-patched) for
BSDI-style crypt(3) hashes (since those are slower), but it usually does
not provide a speedup for LM hashes (too much overhead, key setup not
parallelized, the ordering of candidate passwords is non-optimal for the
underlying key setup algorithm).

Another drawback of the OpenMP approach (or of any other "synchronous"
approach) is that efficiency drops when there's other load on the system.
While on an idle dual-core laptop I am getting 90% efficiency for
traditional DES-based crypt(3) hashes with multiple salts, the
efficiency drops to between 70% and 85% on 8-core servers with almost no
load, and to under 50% when the JtR-unrelated server load is at around 10%.
(These efficiency percentages are relative to combined c/s rates
possible with multiple separate processes run on the same systems.)

So this is for use on otherwise idle systems only.

(The OpenMP parallelization code for slower hashes that is integrated
into JtR 1.7.6 is a lot less sensitive to system load, likely because
the synchronization is less frequent.  Maybe the same could be achieved
by buffering a lot more candidate passwords with these faster hashes -
beyond the current setting of 1024 - although the candidates would not
fit in L1 data cache then, likely resulting in a minor slowdown when
there's no other load.)

Below are some benchmark results.  These are to be compared against
non-parallelized single process benchmark results on the wiki:

http://openwall.info/wiki/john/benchmarks

Without parallelization, my Core 2 Duo T7100 1.8 GHz laptop does (using
just one CPU core at a time):

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     1713K c/s real, 1713K c/s virtual
Only one salt:  1452K c/s real, 1467K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     55680 c/s real, 55680 c/s virtual
Only one salt:  54144 c/s real, 54144 c/s virtual

With 1.7.6-omp-des-2 built with gcc 4.5.0, it does:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     3099K c/s real, 1549K c/s virtual
Only one salt:  2302K c/s real, 1151K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     97280 c/s real, 48398 c/s virtual
Only one salt:  92261 c/s real, 46360 c/s virtual

3099K is roughly 90% of 2x1713K.  Let's see if this is achieved in
practice, intentionally avoiding any matching salts for now:

host!solar:~/john$ ./john-omp-des-2 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
guesses: 11  time: 0:00:00:02  c/s: 3073K  trying: iciici - jprjpr
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:04  c/s: 3080K  trying: odwodw - prfprf
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:13  c/s: 3086K  trying: aqyeaqye - aslnasln

Yes, it is.

Now let's try some quad-core and 8-core servers under low load.

On a Q6600 2.4 GHz, which normally does 2300K to 2400K c/s for traditional
DES-based crypt(3) in the multi-salt case for a single process (using
just one CPU core) with current versions of JtR under no other load, an
OpenMP-enabled build of 1.7.6-omp-des-2 achieves:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     8062K c/s real, 2005K c/s virtual
Only one salt:  4558K c/s real, 1136K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     259072 c/s real, 64445 c/s virtual
Only one salt:  236544 c/s real, 58695 c/s virtual

That's around 85% efficiency (and a lot worse than that for the
single-salt case, but the code was not supposed to perform very well for
that case).  Let's confirm this:

host!solar:~/john$ ./john-omp-des-2 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:01  c/s: 6645K  trying: ldcldc - mqlmql
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:05  c/s: 4858K  trying: abuiabui - adhradhr
guesses: 14  time: 0:00:00:11  c/s: 6157K  trying: bvfwbvfw - bwtfbwtf
guesses: 14  time: 0:00:00:21  c/s: 6726K  trying: eszceszc - eumleuml
guesses: 14  time: 0:00:00:27  c/s: 6673K  trying: ghwmghwm - gjjvgjjv
guesses: 14  time: 0:00:00:35  c/s: 6842K  trying: iqlwiqlw - irzfirzf
guesses: 14  time: 0:00:01:02  c/s: 7095K  trying: qomeqome - qpznqpzn
woofwoof         (u1435-des)
guesses: 15  time: 0:00:01:36  c/s: 7118K  trying: zzwmzzwm - zzzzzzzz

Well, it's a bit worse in practice.  Perhaps I would have captured a
result like this (or worse) if I simply re-ran the "--test" benchmark or
if I let "--test" run for longer.  There's some load on the server, but
not enough to cause such a performance hit on JtR "directly".  Rather,
it does this by causing JtR to wait for termination of all of its
threads when it temporarily switches from parallel to sequential
execution, which it does very often (with this patch).

That's roughly 75% of the full speed of this CPU (achievable with four
separate JtR processes on an idle system), whereas the server load was
directly attributable for maybe 5% of the CPU time.  Another 10% was
lost to thread-safety of the code in this build, and presumably at least
another 10% to thread synchronization (an indirect performance hit from
other load).

Next test system, Core i7 920 2.67 GHz (quad-core, 2 logical CPUs per
core) with a little bit of other load (perhaps under 5%).  Normally, it
does around 2600K c/s for multi-salt with one non-parallelized process,
11500K c/s combined with 8 separate processes.

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     9802K c/s real, 1237K c/s virtual
Only one salt:  4683K c/s real, 585472 c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     324608 c/s real, 40576 c/s virtual
Only one salt:  285696 c/s real, 35712 c/s virtual

Again, roughly 85% efficiency (9800K of 11500K).  Let's confirm:

host!solar:~/john/john-1.7.6-omp-des-2/run$ ./john -e=double --salts=-2 ~/john/pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:01  c/s: 6952K  trying: mqmmqm - odvodv
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:09  c/s: 9121K  trying: cnkmcnkm - coxvcoxv
guesses: 14  time: 0:00:00:24  c/s: 9412K  trying: ieiuieiu - ifwdifwd
guesses: 14  time: 0:00:00:34  c/s: 9491K  trying: maiemaie - mbvnmbvn

This looks about right, 82% of full speed on this test.  Maybe the load
changed a bit.

Finally, a faster system - dual Xeon X5460 (8 cores total) at 3.16 GHz,
but with a bit more load (around 10%).  The benchmarks vary a lot:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     17632K c/s real, 2204K c/s virtual
Only one salt:  3697K c/s real, 480216 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     11803K c/s real, 1526K c/s virtual
Only one salt:  5240K c/s real, 655923 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     11730K c/s real, 1519K c/s virtual
Only one salt:  5500K c/s real, 693533 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     12247K c/s real, 1582K c/s virtual
Only one salt:  5326K c/s real, 668362 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     17444K c/s real, 2202K c/s virtual
Only one salt:  5561K c/s real, 694300 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     13569K c/s real, 1741K c/s virtual
Only one salt:  5575K c/s real, 696089 c/s virtual

A few were totally off - showing a speed of less than 1M c/s (for a
one-second interval), even though the load was never very high.  This
code is really sensitive to any other load.

Normally, this system would do around 3000K c/s for multi-salt per-core,
or up to 24000K c/s total.  So the best measured speed is only 73%.  Yet
it is an impressive number, which compares favorably against that for an
FPGA implementation:

http://www.sump.org/projects/password/

(and JtR lacks the many limitations of that implementation).

For completeness' sake, here are some BSDI-style crypt(3) benchmarks on
this system:

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     660480 c/s real, 82456 c/s virtual
Only one salt:  395264 c/s real, 51266 c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     535552 c/s real, 68837 c/s virtual
Only one salt:  493568 c/s real, 62556 c/s virtual

As expected, these were less sensitive to load (a slower hash).

Let's test:

host!solar:~/john$ ./john-omp-des-2 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:00  c/s: 10624K  trying: jpsjps - ldbldb
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:04  c/s: 10836K  trying: bbnwbbnw - bdbfbdbf
guesses: 14  time: 0:00:00:25  c/s: 8989K  trying: icvkicvk - ieitieit
guesses: 14  time: 0:00:01:10  c/s: 7275K  trying: thrgthrg - tjeptjep
woofwoof         (u1435-des)
guesses: 15  time: 0:00:01:23  c/s: 8203K  trying: zzwmzzwm - zzzzzzzz

I actually let it complete this time.  Well, the speed is down to 47% of
what we saw on the best benchmark run, perhaps because the load was
changing (which was seen between the benchmark runs as well).  Looking at
it another way, this is only 34% of what this system could have achieved
with 8 separate processes and if it were 100% idle.  So the 10% load (my
guesstimate based on looking at "top", etc.) and our desire to simplify
the JtR invocation (just one 8-thread process instead of 8 processes)
and usage (just one .rec file, with one entry in it) combined have cost
us a 3x performance hit in this case...

Overall, I think the approach is usable and reasonable in enough cases
and its added code complexity is low enough for me to integrate
something like it, but it shouldn't be the only way to parallelize JtR.

I am going to try out other approaches later, if time permits.  Two of
the ideas are a specific hybrid process + thread model and partially
asynchronous interfaces (to enable new candidate passwords to be
generated while hashes for the previous bunch are still being computed).

I also haven't given up on the idea of parallelizing JtR in a generic
and scalable way, with separate candidate password streams - somewhat
like what the MPI patch does, but with better usability and reliability
(and without a dependency on MPI).  I am just keeping this on hold and
implementing whatever I can do quickly instead, also considering that
these OpenMP patches and the like may be beneficial under a distributed
model as well, in cases when good efficiency is achieved (fewer nodes to
manage "from the console" if one node is a machine rather than an
individual logical CPU).

As usual, any feedback is welcome.

Alexander

diff -urp john-1.7.6.orig/src/BSDI_fmt.c john-1.7.6-omp-des-2/src/BSDI_fmt.c
--- john-1.7.6.orig/src/BSDI_fmt.c	2010-01-16 17:13:35 +0000
+++ john-1.7.6-omp-des-2/src/BSDI_fmt.c	2010-06-27 16:27:58 +0000
@@ -45,8 +45,8 @@ static struct fmt_tests tests[] = {
 #define BINARY_SIZE			ARCH_SIZE
 #define SALT_SIZE			(ARCH_SIZE * 2)
 
-#define MIN_KEYS_PER_CRYPT		DES_BS_DEPTH
-#define MAX_KEYS_PER_CRYPT		DES_BS_DEPTH
+#define MIN_KEYS_PER_CRYPT		(DES_BS_DEPTH * DES_bs_mt)
+#define MAX_KEYS_PER_CRYPT		(DES_BS_DEPTH * DES_bs_mt)
 
 #else
 
diff -urp john-1.7.6.orig/src/DES_bs.c john-1.7.6-omp-des-2/src/DES_bs.c
--- john-1.7.6.orig/src/DES_bs.c	2010-06-09 14:54:26 +0000
+++ john-1.7.6-omp-des-2/src/DES_bs.c	2010-06-27 17:10:52 +0000
@@ -3,7 +3,13 @@
  * Copyright (c) 1996-2002,2005,2010 by Solar Designer
  */
 
+/* cmp_all() does too little work per call to be worth the overhead */
+#undef CMP_ALL_PARA
+
 #include <string.h>
+#ifdef CMP_ALL_PARA
+#include <signal.h> /* for sig_atomic_t */
+#endif
 
 #include "arch.h"
 #include "common.h"
@@ -14,7 +20,10 @@
 #define DEPTH				[depth]
 #define START				[0]
 #define init_depth() \
+	DES_bs_combined *tp; \
 	int depth; \
+	tp = &DES_bs_all[index / DES_BS_DEPTH]; \
+	index %= DES_BS_DEPTH; \
 	depth = index >> ARCH_BITS_LOG; \
 	index &= (ARCH_BITS - 1);
 #define for_each_depth() \
@@ -22,12 +31,13 @@
 #else
 #define DEPTH
 #define START
-#define init_depth()
+#define init_depth() \
+	DES_bs_combined *tp = &DES_bs_all[index / DES_BS_DEPTH];
 #define for_each_depth()
 #endif
 
 #if !DES_BS_ASM
-DES_bs_combined CC_CACHE_ALIGN DES_bs_all;
+DES_bs_combined CC_CACHE_ALIGN DES_bs_all[DES_bs_mt];
 #endif
 
 static unsigned char DES_LM_KP[56] = {
@@ -55,20 +65,24 @@ void DES_bs_init(int LM)
 	int round, index, bit;
 	int p, q, s;
 	int c;
+	DES_bs_combined *tp;
 
-	DES_bs_all.KS_updates = 0;
+	for_each_t()
+		tp->KS_updates = 0;
 	if (LM)
 		DES_bs_clear_keys_LM();
 	else
 		DES_bs_clear_keys();
 
+for_each_t() {
+
 #if DES_BS_EXPAND
 	if (LM)
-		k = DES_bs_all.KS.p;
+		k = tp->KS.p;
 	else
-		k = DES_bs_all.KSp;
+		k = tp->KSp;
 #else
-	k = DES_bs_all.KS.p;
+	k = tp->KS.p;
 #endif
 
 	s = 0;
@@ -84,7 +98,7 @@ void DES_bs_init(int LM)
 			bit -= bit >> 3;
 			bit = 55 - bit;
 			if (LM) bit = DES_LM_KP[bit];
-			*k++ = &DES_bs_all.K[bit] START;
+			*k++ = &tp->K[bit] START;
 		}
 	}
 
@@ -92,42 +106,46 @@ void DES_bs_init(int LM)
  * non-zero bit in the index. */
 	for (bit = 0; bit <= 7; bit++)
 	for (index = 1 << bit; index < 0x100; index += 2 << bit)
-		DES_bs_all.s1[index] = bit + 1;
+		tp->s1[index] = bit + 1;
 
 /* Special case: instead of doing an extra check in *_set_key*(), we
  * might overrun into DES_bs_all.B, which is harmless as long as the
  * order of fields is unchanged.  57 is the smallest value to guarantee
  * we'd be past the end of K[] since we start at -1. */
-	DES_bs_all.s1[0] = 57;
+	tp->s1[0] = 57;
 
 /* The same for second bits */
 	for (index = 0; index < 0x100; index++) {
-		bit = DES_bs_all.s1[index];
-		bit += DES_bs_all.s1[index >> bit];
-		DES_bs_all.s2[index] = (bit <= 8) ? bit : 57;
+		bit = tp->s1[index];
+		bit += tp->s1[index >> bit];
+		tp->s2[index] = (bit <= 8) ? bit : 57;
 	}
 
 /* Convert to byte offsets */
 	for (index = 0; index < 0x100; index++)
-		DES_bs_all.s1[index] *= sizeof(DES_bs_vector);
+		tp->s1[index] *= sizeof(DES_bs_vector);
 
 	if (LM) {
 		for (c = 0; c < 0x100; c++)
 		if (c >= 'a' && c <= 'z')
-			DES_bs_all.E.extras.u[c] = c & ~0x20;
+			tp->E.extras.u[c] = c & ~0x20;
 		else
-			DES_bs_all.E.extras.u[c] = c;
+			tp->E.extras.u[c] = c;
 	}
 
 #if DES_BS_ASM
 	DES_bs_init_asm();
 #elif defined(__MMX__) || defined(__SSE2__)
-	memset(&DES_bs_all.ones, -1, sizeof(DES_bs_all.ones));
+	memset(tp->ones, -1, sizeof(tp->ones));
 #endif
+} /*t*/
 }
 
 void DES_bs_set_salt(ARCH_WORD salt)
 {
+	DES_bs_combined *tp;
+
+for_each_t() {
 	ARCH_WORD mask;
 	int src, dst;
 
@@ -139,48 +157,59 @@ void DES_bs_set_salt(ARCH_WORD salt)
 			if (dst < 24) src = dst + 24; else src = dst - 24;
 		} else src = dst;
 
-		DES_bs_all.E.E[dst] = &DES_bs_all.B[DES_E[src]] START;
-		DES_bs_all.E.E[dst + 48] = &DES_bs_all.B[DES_E[src] + 32] START;
+		tp->E.E[dst] = &tp->B[DES_E[src]] START;
+		tp->E.E[dst + 48] = &tp->B[DES_E[src] + 32] START;
 
 		mask <<= 1;
 	}
 }
+}
 
 void DES_bs_clear_keys(void)
 {
-	if (DES_bs_all.KS_updates++ & 0xFFF) return;
-	DES_bs_all.KS_updates = 1;
-	memset(DES_bs_all.K, 0, sizeof(DES_bs_all.K));
-	memset(DES_bs_all.keys, 0, sizeof(DES_bs_all.keys));
-	DES_bs_all.keys_changed = 1;
+	DES_bs_combined *tp;
+
+for_each_t() {
+	if (tp->KS_updates++ & 0xFFF) continue;
+	tp->KS_updates = 1;
+	memset(tp->K, 0, sizeof(tp->K));
+	memset(tp->keys, 0, sizeof(tp->keys));
+	tp->keys_changed = 1;
+}
 }
 
 void DES_bs_clear_keys_LM(void)
 {
-	if (DES_bs_all.KS_updates++ & 0xFFF) return;
-	DES_bs_all.KS_updates = 1;
-	memset(DES_bs_all.K, 0, sizeof(DES_bs_all.K));
+	DES_bs_combined *tp;
+
+for_each_t() {
+	if (tp->KS_updates++ & 0xFFF) continue;
+	tp->KS_updates = 1;
+	memset(tp->K, 0, sizeof(tp->K));
 #if !DES_BS_VECTOR && ARCH_BITS >= 64
-	memset(DES_bs_all.E.extras.keys, 0, sizeof(DES_bs_all.E.extras.keys));
+	memset(tp->E.extras.keys, 0, sizeof(tp->E.extras.keys));
 #else
-	memset(DES_bs_all.keys, 0, sizeof(DES_bs_all.keys));
+	memset(tp->keys, 0, sizeof(tp->keys));
 #endif
 }
+}
 
 void DES_bs_set_key(char *key, int index)
 {
 /* new is NUL-terminated, but not NUL-padded to any length;
  * old is NUL-padded to 8 characters, but not always NUL-terminated. */
 	unsigned char *new = (unsigned char *)key;
-	unsigned char *old = DES_bs_all.keys[index];
+	unsigned char *old;
 	DES_bs_vector *k, *kbase;
 	ARCH_WORD mask;
 	unsigned int xor, s1, s2;
 
+	old = DES_bs_all[index / DES_BS_DEPTH].keys[index % DES_BS_DEPTH];
+
 	init_depth();
 
 	mask = (ARCH_WORD)1 << index;
-	k = (DES_bs_vector *)&DES_bs_all.K[0] DEPTH - 1;
+	k = (DES_bs_vector *)&tp->K[0] DEPTH - 1;
 #if ARCH_ALLOWS_UNALIGNED
 	if (*(ARCH_WORD_32 *)new == *(ARCH_WORD_32 *)old &&
 	    old[sizeof(ARCH_WORD_32)]) {
@@ -189,22 +218,22 @@ void DES_bs_set_key(char *key, int index
 		k += sizeof(ARCH_WORD_32) * 7;
 	}
 #endif
-	while (*new && k < &DES_bs_all.K[55]) {
+	while (*new && k < &tp->K[55]) {
 		kbase = k;
 		if ((xor = *new ^ *old)) {
 			xor &= 0x7F; /* Note: this might result in xor == 0 */
 			*old = *new;
 			do {
-				s1 = DES_bs_all.s1[xor];
-				s2 = DES_bs_all.s2[xor];
+				s1 = tp->s1[xor];
+				s2 = tp->s2[xor];
 				*(ARCH_WORD *)((char *)k + s1) ^= mask;
 				if (s2 > 8) break; /* Required for xor == 0 */
 				xor >>= s2;
 				k[s2] START ^= mask;
 				k += s2;
 				if (!xor) break;
-				s1 = DES_bs_all.s1[xor];
-				s2 = DES_bs_all.s2[xor];
+				s1 = tp->s1[xor];
+				s2 = tp->s2[xor];
 				xor >>= s2;
 				*(ARCH_WORD *)((char *)k + s1) ^= mask;
 				k[s2] START ^= mask;
@@ -217,21 +246,21 @@ void DES_bs_set_key(char *key, int index
 		k = kbase + 7;
 	}
 
-	while (*old && k < &DES_bs_all.K[55]) {
+	while (*old && k < &tp->K[55]) {
 		kbase = k;
 		xor = *old & 0x7F; /* Note: this might result in xor == 0 */
 		*old++ = 0;
 		do {
-			s1 = DES_bs_all.s1[xor];
-			s2 = DES_bs_all.s2[xor];
+			s1 = tp->s1[xor];
+			s2 = tp->s2[xor];
 			*(ARCH_WORD *)((char *)k + s1) ^= mask;
 			if (s2 > 8) break; /* Required for xor == 0 */
 			xor >>= s2;
 			k[s2] START ^= mask;
 			k += s2;
 			if (!xor) break;
-			s1 = DES_bs_all.s1[xor];
-			s2 = DES_bs_all.s2[xor];
+			s1 = tp->s1[xor];
+			s2 = tp->s2[xor];
 			xor >>= s2;
 			*(ARCH_WORD *)((char *)k + s1) ^= mask;
 			k[s2] START ^= mask;
@@ -241,7 +270,7 @@ void DES_bs_set_key(char *key, int index
 		k = kbase + 7;
 	}
 
-	DES_bs_all.keys_changed = 1;
+	tp->keys_changed = 1;
 }
 
 void DES_bs_set_key_LM(char *key, int index)
@@ -249,20 +278,22 @@ void DES_bs_set_key_LM(char *key, int in
 /* new is NUL-terminated, but not NUL-padded to any length;
  * old is NUL-padded to 7 characters and NUL-terminated. */
 	unsigned char *new = (unsigned char *)key;
-#if !DES_BS_VECTOR && ARCH_BITS >= 64
-	unsigned char *old = DES_bs_all.E.extras.keys[index];
-#else
-	unsigned char *old = DES_bs_all.keys[index];
-#endif
+	unsigned char *old;
 	DES_bs_vector *k, *kbase;
 	ARCH_WORD mask;
 	unsigned int xor, s1, s2;
 	unsigned char plain;
 
+#if !DES_BS_VECTOR && ARCH_BITS >= 64
+	old = DES_bs_all[index / DES_BS_DEPTH].E.extras.keys[index % DES_BS_DEPTH];
+#else
+	old = DES_bs_all[index / DES_BS_DEPTH].keys[index % DES_BS_DEPTH];
+#endif
+
 	init_depth();
 
 	mask = (ARCH_WORD)1 << index;
-	k = (DES_bs_vector *)&DES_bs_all.K[0] DEPTH - 1;
+	k = (DES_bs_vector *)&tp->K[0] DEPTH - 1;
 #if ARCH_ALLOWS_UNALIGNED
 	if (*(ARCH_WORD_32 *)new == *(ARCH_WORD_32 *)old &&
 	    old[sizeof(ARCH_WORD_32)]) {
@@ -271,21 +302,21 @@ void DES_bs_set_key_LM(char *key, int in
 		k += sizeof(ARCH_WORD_32) * 8;
 	}
 #endif
-	while (*new && k < &DES_bs_all.K[55]) {
-		plain = DES_bs_all.E.extras.u[ARCH_INDEX(*new)];
+	while (*new && k < &tp->K[55]) {
+		plain = tp->E.extras.u[ARCH_INDEX(*new)];
 		kbase = k;
 		if ((xor = plain ^ *old)) {
 			*old = plain;
 			do {
-				s1 = DES_bs_all.s1[xor];
-				s2 = DES_bs_all.s2[xor];
+				s1 = tp->s1[xor];
+				s2 = tp->s2[xor];
 				xor >>= s2;
 				*(ARCH_WORD *)((char *)k + s1) ^= mask;
 				k[s2] START ^= mask;
 				k += s2;
 				if (!xor) break;
-				s1 = DES_bs_all.s1[xor];
-				s2 = DES_bs_all.s2[xor];
+				s1 = tp->s1[xor];
+				s2 = tp->s2[xor];
 				xor >>= s2;
 				*(ARCH_WORD *)((char *)k + s1) ^= mask;
 				k[s2] START ^= mask;
@@ -303,15 +334,15 @@ void DES_bs_set_key_LM(char *key, int in
 		xor = *old;
 		*old++ = 0;
 		do {
-			s1 = DES_bs_all.s1[xor];
-			s2 = DES_bs_all.s2[xor];
+			s1 = tp->s1[xor];
+			s2 = tp->s2[xor];
 			xor >>= s2;
 			*(ARCH_WORD *)((char *)k + s1) ^= mask;
 			k[s2] START ^= mask;
 			k += s2;
 			if (!xor) break;
-			s1 = DES_bs_all.s1[xor];
-			s2 = DES_bs_all.s2[xor];
+			s1 = tp->s1[xor];
+			s2 = tp->s2[xor];
 			xor >>= s2;
 			*(ARCH_WORD *)((char *)k + s1) ^= mask;
 			k[s2] START ^= mask;
@@ -322,25 +353,34 @@ void DES_bs_set_key_LM(char *key, int in
 	}
 }
 
-#if DES_BS_EXPAND
+#if DES_BS_EXPAND && !DES_BS_EXPAND_MERGED
 void DES_bs_expand_keys(void)
 {
+	DES_bs_combined *tp;
+
+#if 0
+/* it may be unreasonable to parallelize this */
+#pragma omp parallel for default(shared) private(tp)
+#endif
+
+for_each_t() {
 	int index;
 #if DES_BS_VECTOR
 	int depth;
 #endif
 
-	if (!DES_bs_all.keys_changed) return;
+	if (!tp->keys_changed) continue;
 
 	for (index = 0; index < 0x300; index++)
 	for_each_depth()
 #if DES_BS_VECTOR
-		DES_bs_all.KS.v[index] DEPTH = DES_bs_all.KSp[index] DEPTH;
+		tp->KS.v[index] DEPTH = tp->KSp[index] DEPTH;
 #else
-		DES_bs_all.KS.v[index] = *DES_bs_all.KSp[index];
+		tp->KS.v[index] = *tp->KSp[index];
 #endif
 
-	DES_bs_all.keys_changed = 0;
+	tp->keys_changed = 0;
+}
 }
 #endif
 
@@ -387,7 +427,7 @@ int DES_bs_get_hash(int index, int count
 	DES_bs_vector *b;
 
 	init_depth();
-	b = (DES_bs_vector *)&DES_bs_all.B[0] DEPTH;
+	b = (DES_bs_vector *)&tp->B[0] DEPTH;
 
 	result = (b[0] START >> index) & 1;
 	result |= ((b[1] START >> index) & 1) << 1;
@@ -428,6 +468,13 @@ int DES_bs_get_hash(int index, int count
  */
 int DES_bs_cmp_all(ARCH_WORD *binary)
 {
+	DES_bs_combined *tp;
+#ifdef CMP_ALL_PARA
+	sig_atomic_t retval = 0;
+
+#pragma omp parallel for default(shared) private(tp)
+#endif
+for_each_t() {
 	ARCH_WORD value, mask;
 	int bit;
 #if DES_BS_VECTOR
@@ -437,7 +484,7 @@ int DES_bs_cmp_all(ARCH_WORD *binary)
 
 	for_each_depth() {
 		value = binary[0];
-		b = (DES_bs_vector *)&DES_bs_all.B[0] DEPTH;
+		b = (DES_bs_vector *)&tp->B[0] DEPTH;
 
 		mask = b[0] START ^ -(value & 1);
 		mask |= b[1] START ^ -((value >> 1) & 1);
@@ -458,12 +505,22 @@ int DES_bs_cmp_all(ARCH_WORD *binary)
 			b += 2;
 		}
 
+#ifdef CMP_ALL_PARA
+		retval = 1;
+		continue; /* have to let the rest of threads run anyway... */
+#else
 		return 1;
+#endif
 next_depth:
 		;
 	}
+}
 
+#ifdef CMP_ALL_PARA
+	return retval;
+#else
 	return 0;
+#endif
 }
 
 int DES_bs_cmp_one(ARCH_WORD *binary, int count, int index)
@@ -472,7 +529,7 @@ int DES_bs_cmp_one(ARCH_WORD *binary, in
 	DES_bs_vector *b;
 
 	init_depth();
-	b = (DES_bs_vector *)&DES_bs_all.B[0] DEPTH;
+	b = (DES_bs_vector *)&tp->B[0] DEPTH;
 
 	for (bit = 0; bit < 31; bit++, b++)
 		if (((b[0] START >> index) ^ (binary[0] >> bit)) & 1) return 0;
diff -urp john-1.7.6.orig/src/DES_bs.h john-1.7.6-omp-des-2/src/DES_bs.h
--- john-1.7.6.orig/src/DES_bs.h	2010-06-09 12:24:14 +0000
+++ john-1.7.6-omp-des-2/src/DES_bs.h	2010-06-27 17:13:56 +0000
@@ -69,9 +69,27 @@ typedef struct {
 	int KS_updates;		/* Key schedule updates counter */
 	int keys_changed;	/* If keys have changed since last expand */
 	unsigned char keys[DES_BS_DEPTH][8];	/* Current keys */
+	int gap[2]; /* XXX: SSE2 alignment hack for DES_bs_mt > 1 */
 } DES_bs_combined;
 
-extern DES_bs_combined DES_bs_all;
+#if DES_BS_ASM
+#define DES_bs_mt			1
+#else
+#define DES_bs_mt			8
+#endif
+
+extern DES_bs_combined DES_bs_all[DES_bs_mt];
+
+#define for_each_t() \
+	for (tp = &DES_bs_all[0]; tp <= &DES_bs_all[DES_bs_mt - 1]; tp++)
+
+#if DES_BS_ASM
+#define DES_BS_EXPAND_MERGED 0
+#else
+/* optional, can set to 0 to have expand in DES_bs.c
+ * or to 1 to have it merged into "crypt body" in DES_bs_b.c */
+#define DES_BS_EXPAND_MERGED 0
+#endif
 
 /*
  * Initializes the internal structures.
@@ -99,7 +117,7 @@ extern void DES_bs_set_key(char *key, in
 /*
  * Initializes the key schedule with actual key bits. Not for LM.
  */
-#if DES_BS_EXPAND
+#if DES_BS_EXPAND && !DES_BS_EXPAND_MERGED
 extern void DES_bs_expand_keys(void);
 #else
 #define DES_bs_expand_keys()
diff -urp john-1.7.6.orig/src/DES_bs_b.c john-1.7.6-omp-des-2/src/DES_bs_b.c
--- john-1.7.6.orig/src/DES_bs_b.c	2010-06-13 18:17:29 +0000
+++ john-1.7.6-omp-des-2/src/DES_bs_b.c	2010-06-27 17:07:24 +0000
@@ -8,6 +8,8 @@
 #if !DES_BS_ASM
 #include "DES_bs.h"
 
+#define _ones ((vtype *)DES_bs_all[0].ones)
+
 #if defined(__ALTIVEC__) && DES_BS_DEPTH == 128
 #undef DES_BS_VECTOR
 
@@ -145,7 +147,7 @@ typedef __m128i vtype;
 	_mm_xor_si128((a), (b))
 
 #define vnot(dst, a) \
-	(dst) = _mm_xor_si128((a), *(vtype *)&DES_bs_all.ones)
+	(dst) = _mm_xor_si128((a), *_ones)
 #define vand(dst, a, b) \
 	(dst) = _mm_and_si128((a), (b))
 #define vor(dst, a, b) \
@@ -153,8 +155,7 @@ typedef __m128i vtype;
 #define vandn(dst, a, b) \
 	(dst) = _mm_andnot_si128((b), (a))
 #define vxorn(dst, a, b) \
-	(dst) = _mm_xor_si128(_mm_xor_si128((a), (b)), \
-	    *(vtype *)&DES_bs_all.ones)
+	(dst) = _mm_xor_si128(_mm_xor_si128((a), (b)), *_ones)
 
 #elif defined(__SSE2__) && defined(__MMX__) && DES_BS_DEPTH == 192 && \
     !defined(DES_BS_NO_MMX)
@@ -178,8 +179,8 @@ typedef struct {
 	(dst).g = _mm_xor_si64((a).g, (b).g)
 
 #define vnot(dst, a) \
-	(dst).f = _mm_xor_si128((a).f, ((vtype *)&DES_bs_all.ones)->f); \
-	(dst).g = _mm_xor_si64((a).g, ((vtype *)&DES_bs_all.ones)->g)
+	(dst).f = _mm_xor_si128((a).f, _ones->f); \
+	(dst).g = _mm_xor_si64((a).g, _ones->g)
 #define vand(dst, a, b) \
 	(dst).f = _mm_and_si128((a).f, (b).f); \
 	(dst).g = _mm_and_si64((a).g, (b).g)
@@ -190,10 +191,8 @@ typedef struct {
 	(dst).f = _mm_andnot_si128((b).f, (a).f); \
 	(dst).g = _mm_andnot_si64((b).g, (a).g)
 #define vxorn(dst, a, b) \
-	(dst).f = _mm_xor_si128(_mm_xor_si128((a).f, (b).f), \
-	    (*(vtype *)&DES_bs_all.ones).f); \
-	(dst).g = _mm_xor_si64(_mm_xor_si64((a).g, (b).g), \
-	    (*(vtype *)&DES_bs_all.ones).g);
+	(dst).f = _mm_xor_si128(_mm_xor_si128((a).f, (b).f), _ones->f); \
+	(dst).g = _mm_xor_si64(_mm_xor_si64((a).g, (b).g), _ones->g);
 
 #elif defined(__SSE2__) && \
     ((ARCH_BITS == 64 && DES_BS_DEPTH == 192) || \
@@ -217,7 +216,7 @@ typedef struct {
 	(dst).g = (a).g ^ (b).g
 
 #define vnot(dst, a) \
-	(dst).f = _mm_xor_si128((a).f, ((vtype *)&DES_bs_all.ones)->f); \
+	(dst).f = _mm_xor_si128((a).f, _ones->f); \
 	(dst).g = ~(a).g
 #define vand(dst, a, b) \
 	(dst).f = _mm_and_si128((a).f, (b).f); \
@@ -229,8 +228,7 @@ typedef struct {
 	(dst).f = _mm_andnot_si128((b).f, (a).f); \
 	(dst).g = (a).g & ~(b).g
 #define vxorn(dst, a, b) \
-	(dst).f = _mm_xor_si128(_mm_xor_si128((a).f, (b).f), \
-	    (*(vtype *)&DES_bs_all.ones).f); \
+	(dst).f = _mm_xor_si128(_mm_xor_si128((a).f, (b).f), _ones->f); \
 	(dst).g = ~((a).g ^ (b).g)
 
 #elif defined(__SSE2__) && defined(__MMX__) && \
@@ -259,8 +257,8 @@ typedef struct {
 	(dst).h = (a).h ^ (b).h
 
 #define vnot(dst, a) \
-	(dst).f = _mm_xor_si128((a).f, ((vtype *)&DES_bs_all.ones)->f); \
-	(dst).g = _mm_xor_si64((a).g, ((vtype *)&DES_bs_all.ones)->g); \
+	(dst).f = _mm_xor_si128((a).f, _ones->f); \
+	(dst).g = _mm_xor_si64((a).g, _ones->g); \
 	(dst).h = ~(a).h
 #define vand(dst, a, b) \
 	(dst).f = _mm_and_si128((a).f, (b).f); \
@@ -275,10 +273,8 @@ typedef struct {
 	(dst).g = _mm_andnot_si64((b).g, (a).g); \
 	(dst).h = (a).h & ~(b).h
 #define vxorn(dst, a, b) \
-	(dst).f = _mm_xor_si128(_mm_xor_si128((a).f, (b).f), \
-	    (*(vtype *)&DES_bs_all.ones).f); \
-	(dst).g = _mm_xor_si64(_mm_xor_si64((a).g, (b).g), \
-	    (*(vtype *)&DES_bs_all.ones).g); \
+	(dst).f = _mm_xor_si128(_mm_xor_si128((a).f, (b).f), _ones->f); \
+	(dst).g = _mm_xor_si64(_mm_xor_si64((a).g, (b).g), _ones->g); \
 	(dst).h = ~((a).h ^ (b).h)
 
 #elif defined(__MMX__) && ARCH_BITS != 64 && DES_BS_DEPTH == 64
@@ -292,7 +288,7 @@ typedef __m64 vtype;
 	_mm_xor_si64((a), (b))
 
 #define vnot(dst, a) \
-	(dst) = _mm_xor_si64((a), *(vtype *)&DES_bs_all.ones)
+	(dst) = _mm_xor_si64((a), *_ones)
 #define vand(dst, a, b) \
 	(dst) = _mm_and_si64((a), (b))
 #define vor(dst, a, b) \
@@ -300,8 +296,7 @@ typedef __m64 vtype;
 #define vandn(dst, a, b) \
 	(dst) = _mm_andnot_si64((b), (a))
 #define vxorn(dst, a, b) \
-	(dst) = _mm_xor_si64(_mm_xor_si64((a), (b)), \
-	    *(vtype *)&DES_bs_all.ones)
+	(dst) = _mm_xor_si64(_mm_xor_si64((a), (b)), *_ones)
 
 #elif defined(__MMX__) && ARCH_BITS == 32 && DES_BS_DEPTH == 96
 #undef DES_BS_VECTOR
@@ -322,7 +317,7 @@ typedef struct {
 	(dst).g = (a).g ^ (b).g
 
 #define vnot(dst, a) \
-	(dst).f = _mm_xor_si64((a).f, ((vtype *)&DES_bs_all.ones)->f); \
+	(dst).f = _mm_xor_si64((a).f, _ones->f); \
 	(dst).g = ~(a).g
 #define vand(dst, a, b) \
 	(dst).f = _mm_and_si64((a).f, (b).f); \
@@ -334,8 +329,7 @@ typedef struct {
 	(dst).f = _mm_andnot_si64((b).f, (a).f); \
 	(dst).g = (a).g & ~(b).g
 #define vxorn(dst, a, b) \
-	(dst).f = _mm_xor_si64(_mm_xor_si64((a).f, (b).f), \
-	    (*(vtype *)&DES_bs_all.ones).f); \
+	(dst).f = _mm_xor_si64(_mm_xor_si64((a).f, (b).f), _ones->f); \
 	(dst).g = ~((a).g ^ (b).g)
 
 #else
@@ -398,8 +392,8 @@ typedef ARCH_WORD vtype;
 #include "nonstd.c"
 #endif
 
-#define b				DES_bs_all.B
-#define e				DES_bs_all.E.E
+#define b				tp->B
+#define e				tp->E.E
 
 #ifndef DES_BS_VECTOR
 #define DES_BS_VECTOR			0
@@ -456,12 +450,32 @@ typedef ARCH_WORD vtype;
 		vst(b[i] bd, 7, v7); \
 	}
 
+#if DES_BS_EXPAND_MERGED
+#undef EK_FUNC
+// #define EK_FUNC
+
+#ifdef EK_FUNC
+static void expand_keys(DES_bs_combined *tp)
+{
+	int index;
+	for (index = 0; index < 0x300; index++)
+	for_each_depth()
+		vst(tp->KS.v[index] kd, 0, *(vtype *)tp->KSp[index] kd);
+	tp->keys_changed = 0;
+}
+#endif
+#endif
+
 #define x(p) vxorf(*(vtype *)&e[p] ed, *(vtype *)&k[p] kd)
 #define y(p, q) vxorf(*(vtype *)&b[p] bd, *(vtype *)&k[q] kd)
 #define z(r) ((vtype *)&b[r] bd)
 
 void DES_bs_crypt(int count)
 {
+	DES_bs_combined *tp;
+
+#pragma omp parallel for default(shared) private(tp)
+for_each_t() {
 #if DES_BS_EXPAND
 	DES_bs_vector *k;
 #else
@@ -471,9 +485,21 @@ void DES_bs_crypt(int count)
 #if DES_BS_VECTOR
 	int depth;
 #endif
-
 #ifndef zero
 	vtype zero;
+#endif
+
+#if DES_BS_EXPAND && DES_BS_EXPAND_MERGED
+	if (tp->keys_changed)
+#ifdef EK_FUNC
+		expand_keys(tp);
+#else
+		goto expand_keys;
+back_from_expand_keys:
+#endif
+#endif
+
+#ifndef zero
 /* This may produce an "uninitialized" warning */
 	vxor(zero, zero, zero);
 #endif
@@ -481,9 +507,9 @@ void DES_bs_crypt(int count)
 	DES_bs_clear_block();
 
 #if DES_BS_EXPAND
-	k = DES_bs_all.KS.v;
+	k = tp->KS.v;
 #else
-	k = DES_bs_all.KS.p;
+	k = tp->KS.p;
 #endif
 	rounds_and_swapped = 8;
 	iterations = count;
@@ -548,16 +574,37 @@ swap:
 	k -= (0x300 + 48);
 	rounds_and_swapped = 0x108;
 	if (--iterations) goto swap;
-	return;
+	continue;
 
 next:
 	k -= (0x300 - 48);
 	rounds_and_swapped = 8;
 	if (--iterations) goto start;
+
+#if DES_BS_EXPAND && DES_BS_EXPAND_MERGED
+#ifndef EK_FUNC
+	continue;
+
+expand_keys:
+	{
+		int index;
+		for (index = 0; index < 0x300; index++)
+		for_each_depth()
+			vst(tp->KS.v[index] kd, 0, *(vtype *)tp->KSp[index] kd);
+	}
+	tp->keys_changed = 0;
+	goto back_from_expand_keys;
+#endif
+#endif
+}
 }
 
 void DES_bs_crypt_25(void)
 {
+	DES_bs_combined *tp;
+
+#pragma omp parallel for default(shared) private(tp)
+for_each_t() {
 #if DES_BS_EXPAND
 	DES_bs_vector *k;
 #else
@@ -567,9 +614,21 @@ void DES_bs_crypt_25(void)
 #if DES_BS_VECTOR
 	int depth;
 #endif
-
 #ifndef zero
 	vtype zero;
+#endif
+
+#if DES_BS_EXPAND && DES_BS_EXPAND_MERGED
+	if (tp->keys_changed)
+#ifdef EK_FUNC
+		expand_keys(tp);
+#else
+		goto expand_keys;
+back_from_expand_keys:
+#endif
+#endif
+
+#ifndef zero
 /* This may produce an "uninitialized" warning */
 	vxor(zero, zero, zero);
 #endif
@@ -577,9 +636,9 @@ void DES_bs_crypt_25(void)
 	DES_bs_clear_block();
 
 #if DES_BS_EXPAND
-	k = DES_bs_all.KS.v;
+	k = tp->KS.v;
 #else
-	k = DES_bs_all.KS.p;
+	k = tp->KS.p;
 #endif
 	rounds_and_swapped = 8;
 	iterations = 25;
@@ -652,13 +711,28 @@ swap:
 	k -= (0x300 + 48);
 	rounds_and_swapped = 0x108;
 	if (--iterations) goto swap;
-	return;
+	continue;
 
 next:
 	k -= (0x300 - 48);
 	rounds_and_swapped = 8;
 	iterations--;
 	goto start;
+
+#if DES_BS_EXPAND && DES_BS_EXPAND_MERGED
+#ifndef EK_FUNC
+expand_keys:
+	{
+		int index;
+		for (index = 0; index < 0x300; index++)
+		for_each_depth()
+			vst(tp->KS.v[index] kd, 0, *(vtype *)tp->KSp[index] kd);
+	}
+	tp->keys_changed = 0;
+	goto back_from_expand_keys;
+#endif
+#endif
+}
 }
 
 #undef x
@@ -672,11 +746,7 @@ next:
 
 void DES_bs_crypt_LM(void)
 {
-	ARCH_WORD **k;
-	int rounds;
-#if DES_BS_VECTOR
-	int depth;
-#endif
+	DES_bs_combined *tp;
 
 #ifndef zero
 	vtype zero, ones;
@@ -685,6 +755,14 @@ void DES_bs_crypt_LM(void)
 	vnot(ones, zero);
 #endif
 
+#pragma omp parallel for default(shared) private(tp)
+for_each_t() {
+	ARCH_WORD **k;
+	int rounds;
+#if DES_BS_VECTOR
+	int depth;
+#endif
+
 	DES_bs_set_block_8(0, zero, zero, zero, zero, zero, zero, zero, zero);
 	DES_bs_set_block_8(8, ones, ones, ones, zero, ones, zero, zero, zero);
 	DES_bs_set_block_8(16, zero, zero, zero, zero, zero, zero, zero, ones);
@@ -694,7 +772,7 @@ void DES_bs_crypt_LM(void)
 	DES_bs_set_block_8(48, ones, ones, zero, zero, zero, zero, ones, zero);
 	DES_bs_set_block_8(56, ones, zero, ones, zero, ones, ones, ones, ones);
 
-	k = DES_bs_all.KS.p;
+	k = tp->KS.p;
 	rounds = 8;
 
 	do {
@@ -767,4 +845,5 @@ void DES_bs_crypt_LM(void)
 		k += 96;
 	} while (--rounds);
 }
+}
 #endif
diff -urp john-1.7.6.orig/src/DES_fmt.c john-1.7.6-omp-des-2/src/DES_fmt.c
--- john-1.7.6.orig/src/DES_fmt.c	2010-01-16 17:05:46 +0000
+++ john-1.7.6-omp-des-2/src/DES_fmt.c	2010-06-27 16:27:52 +0000
@@ -38,8 +38,8 @@ static struct fmt_tests tests[] = {
 #define BINARY_SIZE			ARCH_SIZE
 #define SALT_SIZE			ARCH_SIZE
 
-#define MIN_KEYS_PER_CRYPT		DES_BS_DEPTH
-#define MAX_KEYS_PER_CRYPT		DES_BS_DEPTH
+#define MIN_KEYS_PER_CRYPT		(DES_BS_DEPTH * DES_bs_mt)
+#define MAX_KEYS_PER_CRYPT		(DES_BS_DEPTH * DES_bs_mt)
 
 #else
 
@@ -323,7 +323,8 @@ static char *get_key(int index)
 	static char out[PLAINTEXT_LENGTH + 1];
 
 #if DES_BS
-	memcpy(out, DES_bs_all.keys[index], PLAINTEXT_LENGTH);
+	int t = index / DES_BS_DEPTH; index %= DES_BS_DEPTH;
+	memcpy(out, DES_bs_all[t].keys[index], PLAINTEXT_LENGTH);
 #else
 	memcpy(out, buffer[index].key, PLAINTEXT_LENGTH);
 #endif
diff -urp john-1.7.6.orig/src/LM_fmt.c john-1.7.6-omp-des-2/src/LM_fmt.c
--- john-1.7.6.orig/src/LM_fmt.c	2010-01-16 17:16:20 +0000
+++ john-1.7.6-omp-des-2/src/LM_fmt.c	2010-06-26 21:43:01 +0000
@@ -41,8 +41,8 @@ static struct fmt_tests tests[] = {
 #define BINARY_SIZE			ARCH_SIZE
 #define SALT_SIZE			0
 
-#define MIN_KEYS_PER_CRYPT		DES_BS_DEPTH
-#define MAX_KEYS_PER_CRYPT		DES_BS_DEPTH
+#define MIN_KEYS_PER_CRYPT		(DES_BS_DEPTH * DES_bs_mt)
+#define MAX_KEYS_PER_CRYPT		(DES_BS_DEPTH * DES_bs_mt)
 
 static void init(void)
 {
@@ -175,9 +175,9 @@ static int cmp_exact(char *source, int i
 static char *get_key(int index)
 {
 #if !DES_BS_VECTOR && ARCH_BITS >= 64
-	return (char *)DES_bs_all.E.extras.keys[index];
+	return (char *)DES_bs_all[index / DES_BS_DEPTH].E.extras.keys[index % DES_BS_DEPTH];
 #else
-	return (char *)DES_bs_all.keys[index];
+	return (char *)DES_bs_all[index / DES_BS_DEPTH].keys[index % DES_BS_DEPTH];
 #endif
 }
 
diff -urp john-1.7.6.orig/src/Makefile john-1.7.6-omp-des-2/src/Makefile
--- john-1.7.6.orig/src/Makefile	2010-06-13 21:12:37 +0000
+++ john-1.7.6-omp-des-2/src/Makefile	2010-06-26 23:51:52 +0000
@@ -16,7 +16,7 @@ NULL = /dev/null
 CPPFLAGS = -E
 OMPFLAGS =
 # gcc with OpenMP
-#OMPFLAGS = -fopenmp
+OMPFLAGS = -fopenmp
 # Sun Studio with OpenMP (set the OMP_NUM_THREADS env var at runtime)
 #OMPFLAGS = -xopenmp
 CFLAGS = -c -Wall -O2 -fomit-frame-pointer $(OMPFLAGS)
diff -urp john-1.7.6.orig/src/params.h john-1.7.6-omp-des-2/src/params.h
--- john-1.7.6.orig/src/params.h	2010-06-14 02:38:55 +0000
+++ john-1.7.6-omp-des-2/src/params.h	2010-06-27 15:01:28 +0000
@@ -15,7 +15,7 @@
 /*
  * John's version number.
  */
-#define JOHN_VERSION			"1.7.6"
+#define JOHN_VERSION			"1.7.6-des-2"
 
 /*
  * Notes to packagers of John for *BSD "ports", Linux distributions, etc.:
diff -urp john-1.7.6.orig/src/x86-64.h john-1.7.6-omp-des-2/src/x86-64.h
--- john-1.7.6.orig/src/x86-64.h	2010-06-13 00:33:38 +0000
+++ john-1.7.6-omp-des-2/src/x86-64.h	2010-06-27 15:22:39 +0000
@@ -33,7 +33,8 @@
 #define DES_EXTB			1
 #define DES_COPY			0
 #if defined(__SSE2__) && \
-    ((__GNUC__ == 4 && __GNUC_MINOR__ >= 4) || __GNUC__ > 4)
+    (((__GNUC__ == 4 && __GNUC_MINOR__ >= 4) || __GNUC__ > 4) || \
+    defined(_OPENMP))
 #define DES_BS_ASM			0
 #if 1
 #define DES_BS_VECTOR			2

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ