crypt-dev - optimizing Eksblowfish in FPGA

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20110821163356.GA11209@openwall.com>
Date: Sun, 21 Aug 2011 20:33:56 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: optimizing Eksblowfish in FPGA

Yuri, all -

I initially wrote some of the below in an off-list message to Yuri, but
I feel that it's of more use when posted to the list.  Yuri - I'll start
by repeating some of what I already sent to you off-list, then add to
it - so please scroll down to see the new stuff.

Achieving the desired performance for Eksblowfish will probably require
all of:

- Two interleaved Eksblowfish instances per core, for a throughput of
two Blowfish rounds per two clock cycles (effectively 1 cycle per round).
- Lots of cores per Virtex-6 chip (on the order of 100).
- Clock rate of 150 MHz or so (preferably 200+ MHz, maybe using
BlockRAMs' buffer registers, which you're not using yet).

Comparing this to a quad-core CPU at 3.0 GHz and doing one Blowfish
round in 5 clock cycles (per core), we get:

Speedup from fewer cycles per round: 5x
Speedup from more cores: 25x
Slowdown from lower clock rate: 20x

So this gives us a reasonable 6.25x overall speedup.  But I understand
it'd be tough to achieve.  Also, it assumes zero communication delays,
whereas in practice it'd be tough to minimize those delays to an
acceptable level (I guess you'd need to have data transfers overlap with
actual operation of the cores - e.g., core #1 should start processing as
soon as its data is in FPGA, and it should be the first one to query for
results).

New stuff (not yet discussed via private e-mail):

Perhaps to use buffer registers, you'd need to interleave three and not
just two instances per core.  It'd be something like this:

Next cycle:
request memory reads for instance #1
process previously requested data for instance #2

Next cycle:
request memory reads for instance #2
process previously requested data for instance #3

Next cycle:
request memory reads for instance #3
process previously requested data for instance #1

So you have a latency of 2 cycles on memory reads, which likely allows
for a higher clock rate.  A drawback may be that a smaller percentage of
LUTs will be in use on a given clock cycle, though.  So this may or may
not be an overall improvement - you'd need to experiment with it.

Then, to save on control logic, you may use even more instances per
core, but running them in parallel rather than interleaved.  So in the
example above, you'd have something like this on a given clock cycle,
all in the same core (same state machine):

request memory reads for instance #1
request memory reads for instance #4
request memory reads for instance #7
process previously requested data for instance #2
process previously requested data for instance #5
process previously requested data for instance #8

These are just some thoughts that I'd like to have recorded for when
it's the right time to optimize things.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.