john-users - Re: fast freebsd MD5 implementation

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081016230214.GA16408@openwall.com>
Date: Fri, 17 Oct 2008 03:02:14 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: fast freebsd MD5 implementation

On Wed, Oct 15, 2008 at 03:10:49PM +0200, Simon Marechal wrote:
> After exchanging mails with the author of http://3.14.by/en/md5, I toyed 
> with the idea of using similar code for freebsd MD5. So here is a patch 
> that uses a similar implementation, but for the whole freebsd MD5 
> crypt().

Nice stuff, but very dirty as you know. ;-)

> It is set to use ICC, so you might want to download it (it 
> speeds everything up anyway) to test, but it should be possible to have 
> it run on gcc (don't forget -march=nocona). The patch is against 
> john-1.7.3.1-all, not vanilla! (it should work with vanilla anyway)

The patch applies cleanly against john-1.7.3.1-all-5; a Makefile hunk
won't apply to anything older, but the changes to Makefile are mostly
specific to Intel's compiler anyway.  What I did for my testing was to
not use the Makefile patch, but to add sse-intrinsics.o (not the .c as
the patch does!) to JOHN_OBJS_MINIMAL manually.

I used gcc 4.3.1 on linux-x86-64 to compile/test this.  In
sse-intrinsics.c, I had to change <pmmintrin.h> to <emmintrin.h> (the
former requires SSE3, which is not needed here).  I also had to add
"#define MMX_COEF 4" right to that file because that definition does not
come from arch.h on x86-64.  Then it compiled (with lots of warnings),
and even the test succeeds.

> Bench with standard code on my laptop: 3258k/s
> With this code: 10433k/s

What CPU is that?

I am getting:

Athlon 64 3000+ 2.0 GHz
old code: 8970 c/s
new code: 9200 c/s

Q6600 2.4 GHz (one core):
old code: 10200 c/s
new code: 24000 c/s

I used the same "john" binaries (built with gcc 4.3.1) on both machines.

I think that the speedup in your case is bigger primarily because you
were benchmarking "32-bit" code, whereas on x86-64 the existing
implementation ("old code") computes two hashes at a time, which makes
it around 50% faster.

Then I went to test this "for real", on a dummy password file with 120
hashes and a wordlist with the corresponding passwords.  JtR with this
patch applied cracked exactly 40 out of the 120 passwords, so clearly it
does not work right.  Perhaps only one of the three sets of SSE vectors
is being dealt with correctly.  Perhaps we need to make the self-test
more thorough, and have more test vectors in MD5_fmt.c (12 or more?)

> PS: next stop, GPU implementation. I'm getting tired of not cracking 
> enough of these hashes.

This must be fun, but perhaps it'd be better to submit a cleaner patch
(and one that actually works) for inclusion into the jumbo patch first. ;-)

Besides the issues mentioned above, I'd like the number of sets of SSE
vectors you deal with "in parallel" made into a separate #define.  Right
now, you have this number, which is 3, hard-coded into too many places.
It is possible that different numbers will be optimal for some CPUs,
some compilers, or for 32- vs. 64-bit mode vs. AltiVec (yes, by using
C intrinsics it should be possible to have the same source file work for
both SSE2 and AltiVec).

Thanks,

Alexander

-- 
To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply
to the automated confirmation request that will be sent to you.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.