john-users - Re: SIMD performance impact

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201111195804.GA18934@openwall.com>
Date: Wed, 11 Nov 2020 20:58:05 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: SIMD performance impact

On Thu, Oct 15, 2020 at 07:08:32PM +0200, Solar Designer wrote:
> I'm sorry I didn't bother running deliberately suboptimal builds on
> recent CPUs yet, but I've just added benchmarks of AWS EC2 c5.24xlarge
> (2x Intel Xeon Platinum 8275CL, 3.6 GHz all-core turbo) and AWS EC2
> c5a.24xlarge (AMD EPYC 7R32, ~3.3 GHz sustained turbo) as text files
> linked from these AWS EC2 instance names at:
> 
> https://www.openwall.com/john/cloud/
> 
> The Intel benchmark uses AVX-512, the AMD one uses AVX2, except where
> the corresponding JtR format doesn't support SIMD (e.g., bcrypt) or
> doesn't support wide SIMD (e.g., scrypt uses plain AVX).
> 
> AVX-512 wins by a large margin, but on the other hand it's two Intel
> chips for the 96 vCPUs vs. just one AMD chip for the same vCPU count.
> Much higher TDP for the two chips, too.
> 
> Some hightlights, Intel:
> 
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> Many salts:     561512K c/s real, 5906K c/s virtual
> Only one salt:  85685K c/s real, 1415K c/s virtual

> It's curious that AVX-512 speeds up algorithms based on SHA-2 by a
> factor of 3.  I guess that's due to the bit rotate and "ternary logic"
> instructions (3-input LUTs).  It's also curious the same isn't of as
> much help for the faster hashes, especially not for descrypt (even
> though the "ternary logic" instructions are also in use there), maybe
> because we're exceeding L1 data cache with all the in-flight hashes.

Turns out it was primarily Amdahl's law - for descrypt, the comparisons
against loaded hashes were performed by a single thread (but in a smart
manner, not directly comparing hashes one to one).  I added a minimal
fix for this on Nov 1 in 645a378764d1, which resulted in:

Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
Many salts:	831258K c/s real, 8784K c/s virtual
Only one salt:	88033K c/s real, 1417K c/s virtual

Today, I merged a further fix ec6d12bab175, which brought the speeds to:

Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
Many salts:	861929K c/s real, 9087K c/s virtual
Only one salt:	90046K c/s real, 1416K c/s virtual

Single core speed with a non-OpenMP build:

Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... DONE
Many salts:	27218K c/s real, 27218K c/s virtual
Only one salt:	21903K c/s real, 21903K c/s virtual

The candidate passwords stream is still produced by a single thread,
which is why the much lower "Only one salt" speed.  We only address that
limitation for fast hashes with mask mode in OpenCL so far.  However,
for "Many salts" I think this is a very impressive speed to have on CPU.

Actual cracking at ~1.4 hashes/salt achieves ~840M.  With "--fork=2" (2
processes with 48 threads each), it's ~870M combined.  With "--fork=6",
it's ~900M.  With "--fork=48" and a non-OpenMP build, it's ~21.5M*48 =
1032M.  These are with incremental mode locked to length 7 and few
successful cracks (speeds are somewhat lower when frequently detecting
successful cracks).  Going for mask mode also at length 7 improves the
last one of these to ~22M*48 = 1056M.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.