Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 8 Sep 2015 17:18:51 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: md5crypt mmxput*()

On Tue, Sep 08, 2015 at 01:17:14PM +0300, Solar Designer wrote:
> Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
> Raw:    231424 c/s real, 28928 c/s virtual

> I think further speedup is possible by using a switch statement to make
> the shift counts into constants (we have an if anyway, we'll just
> replace it with a switch) like cryptmd5_kernel.cl has.

I cleaned up the code and implemented switch - patch attached.
It turned out to cause a minor performance regression on bull (due to
code size growth maybe?) so I am disabling it for XOP and keep the
performance almost the same as above:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
Raw:    231680 c/s real, 28923 c/s virtual

But it helps a lot on well and super.  well, with changes from earlier
today but not the switch yet:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 256/256 AVX2 8x3]... (8xOMP) DONE
Raw:    397824 c/s real, 49790 c/s virtual

with switch:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 256/256 AVX2 8x3]... (8xOMP) DONE
Raw:    425472 c/s real, 53184 c/s virtual

super, default gcc (old), version from a few days ago:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Raw:    605184 c/s real, 18912 c/s virtual

with my changes from earlier today:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Raw:    619008 c/s real, 19307 c/s virtual

with switch:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Raw:    638976 c/s real, 19943 c/s virtual

super's latest gcc (4.9.1 after "scl enable devtoolset-3 bash") with the
latest code (with switch):

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Raw:    731136 c/s real, 22798 c/s virtual

IIRC, previously it was below 700k.

switch can probably be made beneficial for XOP as well if we reduce code
size elsewhere, but I had no luck with that so far (e.g., simply not
inlining the function causes a bigger performance regression).

Alexander

View attachment "john-md5crypt-bitalign2.diff" of type "text/plain" (4541 bytes)

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ