Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Wed, 24 Jun 2015 11:56:43 -0400
From: Alain Espinosa <alainesp@...ta.cu>
To: john-dev@...ts.openwall.com
Subject: RE: optimizing bcrypt cracking on x86



-------- Original message --------
From: Solar Designer <solar@...nwall.com> 
Date:06/24/2015 12:10 AM (GMT-05:00) 
To: john-dev@...ts.openwall.com 
Cc: 
Subject: [john-dev] optimizing bcrypt cracking on x86 

...Alain has recently mentioned similar poor results for his attempts to
use AVX2

One thing worth trying is to interleave scalar instructions with AVX2. I am not quite sure how AVX2 gather works, but if it uses one ALU port and one read port then there are 3 ALU ports and one read port idle. I will try this in the near future. 

...shldl $16,tmp1,tmp2
(Latency 2, throughput 1)

Faster is:
mov tmp2, tmp1
Shl tmp2, 16
(Latency 2, throughput 0.75)

Note that for Haswell, shld is slower in throughput than older architectures. 

...bextr %r14d,La,tmp1;

I get the BMI speedup using shrx. I think bextr is similar to shld, and is faster to use shrx followed by and. Unfortunately I don't had BMI instructions latency/throughput. There are none in "Intel 64 and IA-32 Architectures Optimization Reference Manual", version 2013, June (too old?).

Regards, 
Alain
[ CONTENT OF TYPE text/html SKIPPED ]

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ