john-dev - Re: [GSoC] JtR SIMD support enhancements

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150425123413.GA20496@openwall.com>
Date: Sat, 25 Apr 2015 15:34:13 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] JtR SIMD support enhancements

Lei,

On Thu, Apr 23, 2015 at 11:35:44PM +0800, Lei Zhang wrote:
> Please see the attachment for a full report.

Thanks!  This shows a mix of decent speeds (descrypt, md5crypt) and
worse speeds and even ridiculously low speeds.  None of these speeds
are particularly high, though.  Modern GPUs generally do better (e.g.,
the md5crypt speed is on par with that of GPUs from a few years back).
But we don't have the full set of formats supported on GPUs yet, so this
may be useful - e.g., for things such as SunMD5.

The numerically higher speeds - those in the millions of c/s - are the
ridiculously low ones, because those are fast hashes that are meant to
reach a billion c/s with proper code.  But that was to be expected,
because of how we've implemented OpenMP so far.  We have the same
problem on CPU; it's just even more profound on MIC due to the
differences in architecture emphasizing the effect of Amdahl's law.
You may experiment with --fork=240 and length locked e.g. to 8 chars.
The cumulative speeds for the fast hashes should be a lot higher then,
possibly reaching a billion c/s for e.g. NTLM and raw MD4.

> I did tune a bunch of OMP_SCALEs. Some them are too big by default and would drain MIC's memory if not tuned. There're just too many formats there to do a thorough check. So I just picked out some formats that have too big a OMP_SCALE (e.g. > 4096), and experimentally tuned it one by one. 

You need to also tune them for best performance.  In fact, tuning of the
OMP_SCALE factors and the interleave factors should be done together.

> Benchmarking: NT [MD4 32/64]... DONE
> Raw:	3509K c/s real, 3509K c/s virtual

No MIC code for it yet?

> Benchmarking: sha256crypt, crypt(3) $5$ (rounds=5000) [SHA256 512/512 MIC 16x]... (240xOMP) DONE
> Speed for cost 1 (iteration count) of 5000
> Raw:	23141 c/s real, 111 c/s virtual
> 
> Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 MIC 8x]... (240xOMP) DONE
> Speed for cost 1 (iteration count) of 5000
> Raw:	6168 c/s real, 25.7 c/s virtual

> Benchmarking: Drupal7, $S$ (x16385) [SHA512 512/512 MIC 8x]... (240xOMP) DONE
> Speed for cost 1 (iteration count) of 16384
> Raw:	2039 c/s real, 8.5 c/s virtual

These are not ridiculous, but they are not great either.  There ought to
be room for improvement here, such as through interleaving.  And these
are actually relevant to be run on MIC.

> Benchmarking: nt2, NT [MD4 512/512 MIC 16x]... DONE
> Raw:	4907K c/s real, 4907K c/s virtual

Very little improvement relative to the "NT" format.  I expected more of
a difference.  Perhaps this will be seen with --fork=240.  I guess the
SIMD instructions have higher latency, so impact the case of running
only one thread/core more.  Need to run 4 threads/core here.

> Benchmarking: phpass ($P$9) [phpass ($P$ or $H$) 128/128 MIC 16x1]... (240xOMP) DONE
> Raw:	17976 c/s real, 75.5 c/s virtual

This is very poor speed.  Needs to be investigated.

> Benchmarking: SunMD5 [MD5 512/512 MIC 16x]... DONE
> Speed for cost 1 (iteration count) of 5000
> Raw:	75.0 c/s real, 75.0 c/s virtual

No OpenMP for it yet - we should add that.  Want to work on this?  Not
only for MIC, but in general.

Meanwhile, it'd be curious to see how it performs with --fork=240.
There's no GPU alternative to this yet, as far as I'm aware, so it's
relevant.

> All 298 formats passed self-tests!

Cool.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.