john-users - Re: SIMD performance impact

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20201112205125.GA25194@openwall.com>
Date: Thu, 12 Nov 2020 21:51:25 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: SIMD performance impact

On Thu, Oct 15, 2020 at 07:08:32PM +0200, Solar Designer wrote:
> I've just added benchmarks of AWS EC2 c5.24xlarge
> (2x Intel Xeon Platinum 8275CL, 3.6 GHz all-core turbo) and AWS EC2

I think the clock rate is actually way lower when running AVX-512 code.
The 3.6 GHz all-core turbo is probably for at most 128-bit SIMD.

> c5a.24xlarge (AMD EPYC 7R32, ~3.3 GHz sustained turbo) as text files
> linked from these AWS EC2 instance names at:
> 
> https://www.openwall.com/john/cloud/
> 
> The Intel benchmark uses AVX-512, the AMD one uses AVX2, except where
> the corresponding JtR format doesn't support SIMD (e.g., bcrypt) or
> doesn't support wide SIMD (e.g., scrypt uses plain AVX).
> 
> AVX-512 wins by a large margin, but on the other hand it's two Intel
> chips for the 96 vCPUs vs. just one AMD chip for the same vCPU count.
> Much higher TDP for the two chips, too.
> 
> Some hightlights, Intel:
> 
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> Many salts:     561512K c/s real, 5906K c/s virtual
> Only one salt:  85685K c/s real, 1415K c/s virtual

> AMD:
> 
> Benchmarking: descrypt, traditional crypt(3) [DES 256/256 AVX2]... (96xOMP) DONE
> Many salts:     408354K c/s real, 4262K c/s virtual
> Only one salt:  64290K c/s real, 668373 c/s virtual

With today's further changes in PR #4453 and further experiments, these
are now improved to:

Intel:

$ GOMP_CPU_AFFINITY=0-47 OMP_NUM_THREADS=48 ./john -test -form=descrypt
Will run 48 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (48xOMP) DONE
Many salts:     941985K c/s real, 19649K c/s virtual
Only one salt:  102051K c/s real, 2126K c/s virtual

The affinity was only needed because our default benchmark is quick -
during actual cracking, the scheduler eventually does the right thing on
its own, without an explicit affinity setting.

$ ./john -test -form=descrypt
Will run 96 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
Many salts:     862912K c/s real, 9047K c/s virtual
Only one salt:  84194K c/s real, 1048K c/s virtual

AMD:

Benchmarking: descrypt, traditional crypt(3) [DES 256/256 AVX2]... (96xOMP) DONE
Many salts:     541163K c/s real, 5640K c/s virtual
Only one salt:  65731K c/s real, 686305 c/s virtual

Actual cracking, Intel:

$ OMP_NUM_THREADS=48 ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10
Using default input encoding: UTF-8
Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 512/512 AVX512F])
Will run 48 OpenMP threads
Using default mask: ?1?2?2?2?2?2?2
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 0.00% (ETA: 2020-11-16 22:59) 0g/s 378092p/s 908193Kc/s 1323MC/s Izodeaa..K0kueaa
0g 0:00:00:20 0.01% (ETA: 2020-11-16 16:23) 0g/s 405301p/s 920561Kc/s 1341MC/s Cj42iaa..Qzc5iaa
0g 0:00:00:30 0.01% (ETA: 2020-11-16 16:23) 0g/s 405368p/s 923657Kc/s 1345MC/s Idusnaa..Kb0cnaa
0g 0:00:00:40 0.01% (ETA: 2020-11-16 14:52) 0g/s 412159p/s 924954Kc/s 1348MC/s Cstqraa..Qdf2raa
0g 0:00:00:50 0.02% (ETA: 2020-11-16 15:10) 0g/s 410828p/s 925635Kc/s 1348MC/s I7xosaa..Koissaa
0g 0:00:01:00 0.02% (ETA: 2020-11-16 15:22) 0g/s 409941p/s 925972Kc/s 1349MC/s 90rptaa..M8pftaa
0g 0:00:01:10 0.02% (ETA: 2020-11-16 15:31) 0g/s 409307p/s 926259Kc/s 1349MC/s wjw5maa..Ez77maa
0g 0:00:01:20 0.02% (ETA: 2020-11-16 14:52) 0g/s 412210p/s 926397Kc/s 1350MC/s 9bkudaa..Mj3gdaa
0g 0:00:01:30 0.03% (ETA: 2020-11-16 15:02) 0g/s 411465p/s 926564Kc/s 1350MC/s ws42yaa..Edc5yaa
0g 0:00:01:40 0.03% (ETA: 2020-11-16 15:10) 0g/s 410869p/s 926655Kc/s 1350MC/s u7hsuaa..3o0cuaa
0g 0:00:01:50 0.03% (ETA: 2020-11-16 14:44) 0g/s 412839p/s 926670Kc/s 1350MC/s w8sqbaa..E7v2baa
0g 0:00:02:00 0.04% (ETA: 2020-11-16 14:52) 0g/s 412228p/s 926705Kc/s 1350MC/s uzxogaa..30esgaa
0g 0:00:02:10 0.04% (ETA: 2020-11-16 14:59) 0g/s 411710p/s 926117Kc/s 1349MC/s lbrppaa..fjpfpaa
0g 0:00:02:20 0.04% (ETA: 2020-11-16 15:05) 0g/s 411267p/s 926082Kc/s 1349MC/s Xlw5jaa..hd77jaa
0g 0:00:02:30 0.05% (ETA: 2020-11-16 14:46) 0g/s 412685p/s 926008Kc/s 1349MC/s lokufaa..fs3gfaa
0g 0:00:02:40 0.05% (ETA: 2020-11-16 14:52) 0g/s 412236p/s 925966Kc/s 1349MC/s X952waa..h7m5waa
0g 0:00:02:50 0.05% (ETA: 2020-11-16 14:57) 0g/s 411840p/s 925962Kc/s 1349MC/s Pwhsxaa..r01cxaa
0g 0:00:03:00 0.05% (ETA: 2020-11-16 15:02) 0g/s 411488p/s 925929Kc/s 1349MC/s Tu9fqaa..Zpsqqaa

I deliberately ran the default mask, which doesn't fit this set of
passwords well, so that there wouldn't be a flood of cracks in this
test (like there would be with "--incremental").

$ OMP_NUM_THREADS=12 ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10 -fork=4
Using default input encoding: UTF-8
Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 512/512 AVX512F])
Will run 12 OpenMP threads per process (48 total across 4 processes)
Node numbers 1-4 of 4 (fork)
Using default mask: ?1?2?2?2?2?2?2
Press 'q' or Ctrl-C to abort, almost any other key for status
3 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 237165Kc/s 345475KC/s Xlwyaap..egkhaap
1 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 234361Kc/s 341393KC/s Xlwyaaa..egkhaaa
2 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 238274Kc/s 347124KC/s Xlwyaam..egkhaam
4 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 235158Kc/s 342610KC/s Xlwyaa0..egkhaa0
3 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 238071Kc/s 346885KC/s axi1aap..o651aap
1 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 236577Kc/s 344729KC/s axi1aaa..o651aaa
2 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 239237Kc/s 348628KC/s axi1aam..o651aam
4 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 237385Kc/s 345861KC/s axi1aa0..o651aa0
3 0g 0:00:00:30 0.01% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238402Kc/s 347413KC/s irjoeap..rbhneap
1 0g 0:00:00:30 0.01% (ETA: 2020-11-16 14:26) 0g/s 103628p/s 237316Kc/s 345879KC/s X9xieaa..erjoeaa
2 0g 0:00:00:30 0.01% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 239565Kc/s 349089KC/s irjoeam..rbhneam
4 0g 0:00:00:30 0.01% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238136Kc/s 347043KC/s irjoea0..rbhnea0
3 0g 0:00:00:40 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 238448Kc/s 347542KC/s ayrkeap..ow7keap
1 0g 0:00:00:40 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 237561Kc/s 346192KC/s ayrkeaa..ow7keaa
4 0g 0:00:00:40 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 238377Kc/s 347427KC/s ayrkea0..ow7kea0
2 0g 0:00:00:40 0.01% (ETA: 2020-11-16 12:02) 0g/s 106444p/s 239603Kc/s 349169KC/s nw7keam..s53geam
3 0g 0:00:00:50 0.02% (ETA: 2020-11-16 12:53) 0g/s 105431p/s 238403Kc/s 347419KC/s i3f3eap..rok9eap
1 0g 0:00:00:50 0.02% (ETA: 2020-11-16 12:53) 0g/s 105431p/s 237649Kc/s 346311KC/s i3f3eaa..rok9eaa
2 0g 0:00:00:50 0.02% (ETA: 2020-11-16 11:46) 0g/s 106782p/s 239560Kc/s 349141KC/s lok9eam..mhc8eam
4 0g 0:00:00:50 0.02% (ETA: 2020-11-16 12:53) 0g/s 105431p/s 238464Kc/s 347512KC/s i3f3ea0..rok9ea0
william          (u245-des)
3 0g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238262Kc/s 347217KC/s ncisiap..sv5siap
1 0g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 237611Kc/s 346302KC/s ncisiaa..sv5siaa
2 1g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0.01666g/s 105881p/s 239410Kc/s 348910KC/s ncisiam..sv5siam
4 0g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238409Kc/s 347423KC/s ncisia0..sv5sia0

Slightly higher cumulative speed with some use of "--fork" (4 processes
with 12 threads each): ~953M total.

It'd still take full use of "--fork" instead of OpenMP to get beyond 1
billion like I mentioned before, but we're getting closer to that with
OpenMP now.

Actual cracking, AMD:

$ ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10 
Using default input encoding: UTF-8
Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 256/256 AVX2])
Will run 96 OpenMP threads
Using default mask: ?1?2?2?2?2?2?2
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 554729Kc/s 807469KC/s ceh3aaa..0cqaeaa
0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 552576Kc/s 804401KC/s vi1weaa..Edi9eaa
0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 552507Kc/s 804952KC/s 9ookiaa..Dybziaa
0g 0:00:00:40 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 551293Kc/s 803060KC/s Lnkmoaa..Fh2koaa
0g 0:00:00:50 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 551107Kc/s 803009KC/s Hr3inaa..rbrcnaa
0g 0:00:01:00 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 550738Kc/s 802691KC/s Xll5naa..bkporaa

Somehow slightly better speed than the benchmark reported.

$ ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10 -fork=4
Using default input encoding: UTF-8
Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 256/256 AVX2])
Will run 24 OpenMP threads per process (96 total across 4 processes)
Node numbers 1-4 of 4 (fork)
Using default mask: ?1?2?2?2?2?2?2
Press 'q' or Ctrl-C to abort, almost any other key for status
1 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 140385Kc/s 204331KC/s pmysaaa..Edlmaaa
4 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 142035Kc/s 206615KC/s pmysaa0..Edlmaa0
2 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 142565Kc/s 207307KC/s pmysaam..Edlmaam
3 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 142049Kc/s 206629KC/s pmysaap..Edlmaap
1 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 140374Kc/s 204434KC/s Apxuaaa..Jvpkaaa
2 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 142562Kc/s 207573KC/s Apxuaam..Jvpkaam
3 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 142113Kc/s 206932KC/s Apxuaap..Jvpkaap
4 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 142150Kc/s 206983KC/s Apxuaa0..Jvpkaa0
1 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 140409Kc/s 204600KC/s G0awaaa..d99zaaa
2 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 142498Kc/s 207804KC/s G0awaam..d99zaam
4 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 142124Kc/s 207219KC/s G0awaa0..d99zaa0
3 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 142075Kc/s 207126KC/s G0awaap..d99zaap
2 0g 0:00:00:40 0.01% (ETA: 2020-11-19 01:55) 0g/s 62653p/s 142468Kc/s 207566KC/s 9os8aam..Yre4aam
1 0g 0:00:00:40 0.01% (ETA: 2020-11-19 11:16) 0g/s 58967p/s 140370Kc/s 204591KC/s ceh3aaa..3os8aaa
4 0g 0:00:00:40 0.01% (ETA: 2020-11-19 01:55) 0g/s 62653p/s 142110Kc/s 207042KC/s 9os8aa0..Yre4aa0
3 0g 0:00:00:40 0.01% (ETA: 2020-11-19 01:55) 0g/s 62653p/s 142033Kc/s 206943KC/s 9os8aap..Yre4aap
2 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 142436Kc/s 207550KC/s Byjieam..rbhneam
1 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 140354Kc/s 204506KC/s Byjieaa..rbhneaa
4 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 142082Kc/s 207004KC/s Byjiea0..rbhnea0
3 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 142003Kc/s 206883KC/s Byjieap..rbhneap
2 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61440p/s 142449Kc/s 207571KC/s nw8meam..zxqdeam
1 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61429p/s 140353Kc/s 204494KC/s nw8meaa..zxqdeaa
3 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61429p/s 142009Kc/s 206947KC/s nw8meap..zxqdeap
4 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61429p/s 142095Kc/s 207074KC/s nw8mea0..zxqdea0

That's ~567M total.  Again, pure "--fork" in a non-OpenMP build would be
somewhat faster, but anyhow these speeds are not bad for a CPU.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.