Date: Sat, 25 Apr 2015 23:10:05 +0800
From: Lei Zhang <>
Subject: Re: [GSoC] JtR SIMD support enhancements

> On Apr 25, 2015, at 8:34 PM, Solar Designer <> wrote
> On Thu, Apr 23, 2015 at 11:35:44PM +0800, Lei Zhang wrote:
>> Please see the attachment for a full report.
> Thanks!  This shows a mix of decent speeds (descrypt, md5crypt) and
> worse speeds and even ridiculously low speeds.  None of these speeds
> are particularly high, though.  Modern GPUs generally do better (e.g.,
> the md5crypt speed is on par with that of GPUs from a few years back).
> But we don't have the full set of formats supported on GPUs yet, so this
> may be useful - e.g., for things such as SunMD5.
> The numerically higher speeds - those in the millions of c/s - are the
> ridiculously low ones, because those are fast hashes that are meant to
> reach a billion c/s with proper code.  But that was to be expected,
> because of how we've implemented OpenMP so far.  We have the same
> problem on CPU; it's just even more profound on MIC due to the
> differences in architecture emphasizing the effect of Amdahl's law.
> You may experiment with --fork=240 and length locked e.g. to 8 chars.
> The cumulative speeds for the fast hashes should be a lot higher then,
> possibly reaching a billion c/s for e.g. NTLM and raw MD4.
>> I did tune a bunch of OMP_SCALEs. Some them are too big by default and would drain MIC's memory if not tuned. There're just too many formats there to do a thorough check. So I just picked out some formats that have too big a OMP_SCALE (e.g. > 4096), and experimentally tuned it one by one. 
> You need to also tune them for best performance.  In fact, tuning of the
> OMP_SCALE factors and the interleave factors should be done together.
>> Benchmarking: NT [MD4 32/64]... DONE
>> Raw:	3509K c/s real, 3509K c/s virtual
> No MIC code for it yet?
>> Benchmarking: sha256crypt, crypt(3) $5$ (rounds=5000) [SHA256 512/512 MIC 16x]... (240xOMP) DONE
>> Speed for cost 1 (iteration count) of 5000
>> Raw:	23141 c/s real, 111 c/s virtual
>> Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 MIC 8x]... (240xOMP) DONE
>> Speed for cost 1 (iteration count) of 5000
>> Raw:	6168 c/s real, 25.7 c/s virtual
>> Benchmarking: Drupal7, $S$ (x16385) [SHA512 512/512 MIC 8x]... (240xOMP) DONE
>> Speed for cost 1 (iteration count) of 16384
>> Raw:	2039 c/s real, 8.5 c/s virtual
> These are not ridiculous, but they are not great either.  There ought to
> be room for improvement here, such as through interleaving.  And these
> are actually relevant to be run on MIC.
>> Benchmarking: nt2, NT [MD4 512/512 MIC 16x]... DONE
>> Raw:	4907K c/s real, 4907K c/s virtual
> Very little improvement relative to the "NT" format.  I expected more of
> a difference.  Perhaps this will be seen with --fork=240.  I guess the
> SIMD instructions have higher latency, so impact the case of running
> only one thread/core more.  Need to run 4 threads/core here.
>> Benchmarking: phpass ($P$9) [phpass ($P$ or $H$) 128/128 MIC 16x1]... (240xOMP) DONE
>> Raw:	17976 c/s real, 75.5 c/s virtual
> This is very poor speed.  Needs to be investigated.

It'll take me some time walking through all of the above. I'll report back later.

>> Benchmarking: SunMD5 [MD5 512/512 MIC 16x]... DONE
>> Speed for cost 1 (iteration count) of 5000
>> Raw:	75.0 c/s real, 75.0 c/s virtual
> No OpenMP for it yet - we should add that.  Want to work on this?  Not
> only for MIC, but in general.

Sure, I can work on this.


