Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 10 Sep 2013 01:12:49 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard / Parallella: bcrypt

Katja,

On Mon, Sep 09, 2013 at 09:48:33PM +0200, Katja Malvoni wrote:
> I found out bug in verilog code and fixed it. Now it produces correct
> result for all test vectors.
> 
> At the moment I'm using usleep(78125) instead of done signal

Thanks for these status updates!

So that's 12.8 c/s per bcrypt core for now.  Given the current FPGA
utilization with and without this one bcrypt core added, roughly how
many such bcrypt cores would fit?  What clock rate are you using?
What's the maximum clock rate for this design on this device, as
reported by ISE?  Can you easily increase the clock rate to be closer to
that maximum?

I understand that these questions may be premature, yet I'd like us to
establish a performance baseline before proceeding with optimizations.

12.8 c/s roughly corresponds to the original Pentium at 75 MHz.  Since
your clock rate is probably comparable to that, it's a fine speed to
have for totally unoptimized code.  We just need to make it a few times
higher with optimizations, and use many cores.

Theoretically, going from block RAM count, we can have up to 140 bcrypt
cores in a Zynq 7020.  If other FPGA resources permit us to have this
many cores (or at least more than 70), we need to keep four S-boxes per
block RAM, which means that with two ports we'll be able to do two S-box
lookups per core per cycle.  Ideally, we'd have the data available the
next cycle, but even if the latency is somehow 2 cycles (such as because
we chose to enable registers in order to achieve a higher clock rate),
we can nevertheless proceed with the next two lookups right away half of
the time.  This is because at the start of a Blowfish round, we have
indices for the four S-boxes, but we have to wait for all four lookup
results to be available before we can proceed with the next round.

On the other hand, if other resource utilization is such that we can't
reasonably hope to achieve 70+ cores, then we may use 2+ separate
block RAMs per core in order to increase the port count, and thus the
number of S-box lookups made at once.

Another aspect is the number of bcrypt instances per core, which is both
an input to and an outcome of our decision-making on better use of block
RAMs and on the number of cores.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.