Date: Wed, 11 Nov 2020 20:58:05 +0100 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: SIMD performance impact On Thu, Oct 15, 2020 at 07:08:32PM +0200, Solar Designer wrote: > I'm sorry I didn't bother running deliberately suboptimal builds on > recent CPUs yet, but I've just added benchmarks of AWS EC2 c5.24xlarge > (2x Intel Xeon Platinum 8275CL, 3.6 GHz all-core turbo) and AWS EC2 > c5a.24xlarge (AMD EPYC 7R32, ~3.3 GHz sustained turbo) as text files > linked from these AWS EC2 instance names at: > > https://www.openwall.com/john/cloud/ > > The Intel benchmark uses AVX-512, the AMD one uses AVX2, except where > the corresponding JtR format doesn't support SIMD (e.g., bcrypt) or > doesn't support wide SIMD (e.g., scrypt uses plain AVX). > > AVX-512 wins by a large margin, but on the other hand it's two Intel > chips for the 96 vCPUs vs. just one AMD chip for the same vCPU count. > Much higher TDP for the two chips, too. > > Some hightlights, Intel: > > Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE > Many salts: 561512K c/s real, 5906K c/s virtual > Only one salt: 85685K c/s real, 1415K c/s virtual > It's curious that AVX-512 speeds up algorithms based on SHA-2 by a > factor of 3. I guess that's due to the bit rotate and "ternary logic" > instructions (3-input LUTs). It's also curious the same isn't of as > much help for the faster hashes, especially not for descrypt (even > though the "ternary logic" instructions are also in use there), maybe > because we're exceeding L1 data cache with all the in-flight hashes. Turns out it was primarily Amdahl's law - for descrypt, the comparisons against loaded hashes were performed by a single thread (but in a smart manner, not directly comparing hashes one to one). I added a minimal fix for this on Nov 1 in 645a378764d1, which resulted in: Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE Many salts: 831258K c/s real, 8784K c/s virtual Only one salt: 88033K c/s real, 1417K c/s virtual Today, I merged a further fix ec6d12bab175, which brought the speeds to: Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE Many salts: 861929K c/s real, 9087K c/s virtual Only one salt: 90046K c/s real, 1416K c/s virtual Single core speed with a non-OpenMP build: Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... DONE Many salts: 27218K c/s real, 27218K c/s virtual Only one salt: 21903K c/s real, 21903K c/s virtual The candidate passwords stream is still produced by a single thread, which is why the much lower "Only one salt" speed. We only address that limitation for fast hashes with mask mode in OpenCL so far. However, for "Many salts" I think this is a very impressive speed to have on CPU. Actual cracking at ~1.4 hashes/salt achieves ~840M. With "--fork=2" (2 processes with 48 threads each), it's ~870M combined. With "--fork=6", it's ~900M. With "--fork=48" and a non-OpenMP build, it's ~21.5M*48 = 1032M. These are with incremental mode locked to length 7 and few successful cracks (speeds are somewhat lower when frequently detecting successful cracks). Going for mask mode also at length 7 improves the last one of these to ~22M*48 = 1056M. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.