john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131103220247.GA25424@openwall.com>
Date: Mon, 4 Nov 2013 02:02:47 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

Hi Katja,

On Sun, Nov 03, 2013 at 02:06:27PM +0100, Katja Malvoni wrote:
> On Wed, Oct 30, 2013 at 10:17 AM, Solar Designer <solar@...nwall.com> wrote:
> 
> > If so, does anything prevent you from optimizing this to? -
> >
> > Cycle 0: compute new R; swap L and R; initiate 4 S-box lookups
> > Cycle 1: wait
> 
> I implemented this -

Great!  I think your next step is to implement two instances of bcrypt
per core, so that there are no wait-only cycles.  That is, in Cycle 1
above you would be doing the same kind of work as on Cycle 0, but for
the other instance.  You may use the currently wasted halves of the same
RAM blocks (just set the most significant address bit when doing the
memory accesses for the second bcrypt instance) or you may use separate
RAM blocks - whichever results in lower utilization of other resources.

> performance on self test for one core is 79 c/s while
> for 14 cores it's 765 c/s. For cost 12 these numbers are 0.6656c/s for 1
> core and 8.002c/s for 14 cores. Overhead of loading data from shared BRAM
> into per core BRAMs is significant.

I think it's not only the overhead of loading data, but also the
overhead of host-side computation, which is not currently overlapped
with computation on the FPGA.  Remember that you only implemented
bcrypt's variable-cost loop on the FPGA, keeping some fixed-cost
Blowfish stuff before and after this loop on the host CPU.  Although
JtR's format interface currently requires that everything is in sync by
the time crypt_all() returns (no precomputation for next set of
candidate passwords possible at this point), you may nevertheless
overlap host and FPGA computation most of the time by making
max_keys_per_crypt several times higher and overlapping things inside of
crypt_all(), except for the very last subset of candidate passwords.

I suggest that you make this max_keys_per_crypt increase factor
configurable - at least at compile-time, or it can even be chosen at
runtime since the format's init() may modify max_keys_per_crypt.

For example, with 14 cores and two bcrypt instances per core, you'd have
min_keys_per_crypt at 28, but you may have max_keys_per_crypt at higher
multiples of 28 - e.g., 112.  With that, you'd be able to overlap host
and FPGA computation 3/4th of the time.

Is the above explanation clear?  Please feel free to ask any questions
you might have.

> Maximum frequency is now 93.765 MHz although design seems to be working
> properly with 100 MHz clock.

OK.  I think you might be able to optimize this and the LUTs utilization
later.  For now, please focus on getting a second instance of bcrypt per
core.  I hope you'll be able to keep the core count at 14 or, if you
have to, very slightly lower.

In fact, you might be able to achieve better overall results with even
more than two bcrypt instances per core - try that after you get two
instances working.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.