john-dev - Re: optimizing bcrypt cracking on x86

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150701100505.GA9071@openwall.com>
Date: Wed, 1 Jul 2015 13:05:05 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimizing bcrypt cracking on x86

On Thu, Jun 25, 2015 at 07:33:21AM +0300, Solar Designer wrote:
> Regarding the 2x2 MMX2 code on i7-4770K:
> 
> On Wed, Jun 24, 2015 at 07:10:07AM +0300, Solar Designer wrote:
> > On 64-bit builds, though, I only got this to run at cumulative speeds
> > like 780*8 = 6240 c/s, which is worse than 6595 c/s previously seen with
> > OpenMP (and even worse than the slightly better speeds that can be seen
> > with separate independent processes).
> 
> I managed to improve this to 796*8 = 6368 c/s by removing some of the
> large displacements on loads, and instead keeping them in base registers
> (using the extra GPRs that we have in 64-bit mode for this).  For the
> 288 bytes of P, an offset into the middle of this range may be put into
> a register, and then 256 out of the 288 bytes may be accessed via 1-byte
> displacements (or alternatively 248 out of 288, but then we can also
> access the first S-box via the same base register with 0x78 in the
> 1-byte displacement).  Also, remembering that R13 is special just like
> RBP (no without-displacement encoding) can sometimes be helpful.

Another related trick, which I haven't tried yet, is to interleave pairs
of S-boxes (from the same bcrypt instance or from different instances).
Then the same base register could be used to access two of such S-boxes
at once, with "4" in the displacement field (fits 1-byte, obviously) for
the second S-box in a pair.  The index scaling would be by 8 rather than
by 4, but it's same cost.  This way, only 4 base registers would be
needed to access everything for 2 bcrypt instances, with at most 1-byte
displacements (thus avoiding 4-byte displacements, which appear to cost
extra on Haswell).

Another advantage of such interleaving is that we're guaranteed to have
no cache bank conflict between lookups from these two S-boxes then.  Per
my testing, this is irrelevant for Haswell, but it might be relevant on
other CPUs (and not only CPUs).

> This is still not good enough, though.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.