john-dev - Re: optimizing bcrypt cracking on x86

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150624064347.GA27412@openwall.com>
Date: Wed, 24 Jun 2015 09:43:47 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimizing bcrypt cracking on x86

On Wed, Jun 24, 2015 at 07:10:07AM +0300, Solar Designer wrote:
> Now, the bottleneck does indeed appear to be different.  It appears to
> be taking too many instructions and uops to extract all of the S-box
> indices and compute the effective addresses to keep the L1 data cache
> read ports 100% busy at all times.  (We only keep them about 60% busy.)

BTW, optimizing the effective address calculation could make a
difference.  In my testing on Haswell, not of bcrypt code but in
general, there is performance impact from using larger than 1-byte
displacement, as well as from using an index register (even if the
scaling factor is set to 1).  Loads that use only a base register with
either no displacement or a 1-byte displacement (sufficient e.g. to skip
P-boxes and get to the first S-box) run faster (or rather, have less
impact on issue rate of adjacent instructions).

To avoid needing index scaling, we could use 64-bit S-box elements,
using bits 33 to 2 for data.  Extracting the lowest pre-scaled index is
then quick.  Extracting the remaining 3, not so much - can't just use
BEXTR, need an AND.  Maybe the cost of an AND can be amortized, if
applied to two non-adjacent pre-scaled indices at once, and then BEXTR
could be used.  BMI's 3-op ANDN instruction may be used to avoid MOVs.

Unfortunately, when using only a base register, we can't avoid either
needing large displacements or ADDing them to that register first.  Yet
in my testing performance impact from using both a large displacement
and an index at once is worse than from using either of these alone.
So maybe large_displacement+base is better than small+base+index*4.
... Just tested on artificial code: yes, this is so, especially when the
instruction is a load-op rather than a pure load,

Unfortunately, there's also performance impact from larger than 1-byte
immediate values in instructions such as AND.  So we'd need to minimize
the number of different masks and preload them into registers.  This
means that this approach is possibly only worthwhile on CPUs with HT,
where we don't need more than 2x interleaving.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.