Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 18 Feb 2006 01:40:02 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: DIGEST-MD5, dominosec optimization

On Fri, Feb 17, 2006 at 11:14:01PM +0100, Michal Luczaj wrote:
> Sure. I'm was using MOVAPS not MOVUPS and I did take care of that.
[...]
> I had no problems with stack alignment. And I'm pretty sure it was
> 16-byte align, otherwise I would get segfault.

I hope so - unless your CPU or OS actually supports unaligned accesses
even with MOVAPS.

> From what I recall I only changed ARCH_ALLOWS_UNALIGNED to 0

This was not needed.  It would only slow down a few things.  x86 _can_
do unaligned 32-bit accesses with penalties that are not too bad, so
ARCH_ALLOWS_UNALIGNED should be set to 1.  Your use of SSE in a few
places does not change that.

> and set -mpreferred-stack-boundary=4.

That's the default with recent gcc.

> And maybe some __attribute__ ((aligned (16)))...? I just don't remember now.

Yes, you would need that attribute - or you would need to use a data
type that is this large anyway.

> Let me explain: I had no SSE specific problems (segfaults and all that
> memory related stuff). Everything worked fine. I just wonder why it
> turned out to be sooo slooow.

I am not sure either.  I would suspect unaligned accesses.

> > As it relates to the XOR'ing, you definitely want to apply it to 
> > quantities larger than "char" whenever possible.  (But this does not
> > appear to be possible in your inner loop.)
> 
> There is something like this:
> 
> 	for (i = 0; i < 16; ++i)
> 		x[i] = state[i] ^ block[i];

Yes, but it's not your inner loop.

> And I switched it to:
> 
> 	_mm_store_ps(
> 		(float*)x,
> 		_mm_xor_ps(
> 			_mm_load_ps((float*)state),
> 			_mm_load_ps((float*)block)
> 			)
> 		)
> 
> Maybe this piece of code is just not important enough...

Exactly.

> But still, even
> if it don't give a boost, why-oh-why it slow the whole thing down?

Well, another guess would be that gcc generates code different from what
you would expect.  Did you review the assembly output ("gcc -S ...")?

Anyway, this has little to do with John the Ripper and with optimizing
your patches to JtR.  If you want to achieve a significant speedup, you
should concentrate on trying multiple passwords in parallel and taking
advantage of that in your inner loop.  For dominosec, it does not appear
to be possible to reasonably use SSE there - so don't.

-- 
Alexander Peslyak <solar at openwall.com>
GPG key ID: B35D3598  fp: 6429 0D7E F130 C13E C929  6447 73C3 A290 B35D 3598
http://www.openwall.com - bringing security into open computing environments

Was I helpful?  Please give your feedback here: http://rate.affero.net/solar

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.