Date: Wed, 11 Nov 2020 21:10:59 +0100
From: Solar Designer <>
Subject: Re: SIMD performance impact

On Wed, Nov 11, 2020 at 08:58:05PM +0100, Solar Designer wrote:
> On Thu, Oct 15, 2020 at 07:08:32PM +0200, Solar Designer wrote:
> > I've just added benchmarks of AWS EC2 c5.24xlarge
> > (2x Intel Xeon Platinum 8275CL, 3.6 GHz all-core turbo) and AWS EC2
> > c5a.24xlarge (AMD EPYC 7R32, ~3.3 GHz sustained turbo) as text files
> > linked from these AWS EC2 instance names at:
> > 
> >

> > Some hightlights, Intel:
> > 
> > Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> > Many salts:     561512K c/s real, 5906K c/s virtual
> > Only one salt:  85685K c/s real, 1415K c/s virtual
> > It's curious that AVX-512 speeds up algorithms based on SHA-2 by a
> > factor of 3.  I guess that's due to the bit rotate and "ternary logic"
> > instructions (3-input LUTs).  It's also curious the same isn't of as
> > much help for the faster hashes, especially not for descrypt (even
> > though the "ternary logic" instructions are also in use there), maybe
> > because we're exceeding L1 data cache with all the in-flight hashes.
> Turns out it was primarily Amdahl's law - for descrypt, the comparisons
> against loaded hashes were performed by a single thread (but in a smart
> manner, not directly comparing hashes one to one).  I added a minimal
> fix for this on Nov 1 in 645a378764d1, which resulted in:
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> Many salts:	831258K c/s real, 8784K c/s virtual
> Only one salt:	88033K c/s real, 1417K c/s virtual
> Today, I merged a further fix ec6d12bab175, which brought the speeds to:
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> Many salts:	861929K c/s real, 9087K c/s virtual
> Only one salt:	90046K c/s real, 1416K c/s virtual
> Single core speed with a non-OpenMP build:
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... DONE
> Many salts:	27218K c/s real, 27218K c/s virtual
> Only one salt:	21903K c/s real, 21903K c/s virtual
> The candidate passwords stream is still produced by a single thread,
> which is why the much lower "Only one salt" speed.  We only address that
> limitation for fast hashes with mask mode in OpenCL so far.  However,
> for "Many salts" I think this is a very impressive speed to have on CPU.
> Actual cracking at ~1.4 hashes/salt achieves ~840M.  With "--fork=2" (2
> processes with 48 threads each), it's ~870M combined.  With "--fork=6",
> it's ~900M.  With "--fork=48" and a non-OpenMP build, it's ~21.5M*48 =
> 1032M.  These are with incremental mode locked to length 7 and few
> successful cracks (speeds are somewhat lower when frequently detecting
> successful cracks).  Going for mask mode also at length 7 improves the
> last one of these to ~22M*48 = 1056M.

Oh, I realized I need to clarify: despite of ~1.4 hashes/salt, the
speeds I posted are c/s (hashes computed per second), not C/s
(effective combinations tested per second) - the latter were accordingly
~1.4x higher than those I posted.  The ~1.4 is simply to have a
realistic test case, as descrypt salt collisions are very common.  The
c/s figures would actually be very slightly higher with only one hash
per salt (due to fewer comparisons to perform), but then it'd be weird
to have many salts without collisions.


