john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131030174241.GA29473@openwall.com>
Date: Wed, 30 Oct 2013 21:42:41 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Wed, Oct 30, 2013 at 05:31:48PM +0100, Katja Malvoni wrote:
> If I use two ports of the BRAM than I can get only two words out and I need
> 4.

Sure.

> So I store each S-box in two BRAM blocks in order to have all 4 values
> after 2 cycles of delay.

It is unclear what you mean by "each S-box" above.  Blowfish uses four
S-boxes.  Do you want to say that you're storing each of the four twice,
for a total of 8 S-boxes stored (two dupes of each)?  If so, I see no
reason for you to be doing that.  In order to have all 4 values after
just one lookup's latency, you simply need to store two S-boxes in one
BRAM block and two more in another, with no data duplication.  Is this
what you're actually doing?  If so, this makes sense to me, but your
wording above (and in a previous message) is inconsistent with it.

> Since LUT utilization is a problem, wasting some
> BRAM to get faster seemed okay.

Sure.

In fact, as I explained in another message earlier today, you probably
don't even have to waste BRAMs with this approach, if you interleave two
instances of bcrypt per core.  That said, it is in fact fine to continue
to waste BRAMs if LUTs remain the scarce resource.

> > Also, weren't you already at 3 cycles per Blowfish round before this
> > change?  If not, then how many cycles per Blowfish round did you have in
> > the revision that achieved ~300 c/s using 14 cores at 100 MHz?
> 
> No, it was 5 cycles -
> Cycle 0: initiate 2 S-box lookups
> Cycle 1: wait
> Cycle 2: initiate other 2 S-box lookups, compute tmp
> Cycle 3: wait
> Cycle 4: compute new L, swap L and R

Did anything prevent you from initiating the second two lookups shown
at Cycle 2 above, on Cycle 1 instead?  So you'd have:

Cycle 0: initiate 2 S-box lookups
Cycle 1: initiate other 2 S-box lookups, compute tmp
Cycle 2: read first two lookup results from BRAM's outputs
Cycle 3: compute new L, swap L and R

or:

Cycle 0: initiate 2 S-box lookups
Cycle 1: initiate other 2 S-box lookups, compute tmp
Cycle 2: start computation of new L using first two lookup results
Cycle 3: compute new L, swap L and R

Indeed, I am not suggesting to go from 3 cycles to 4.  This is just a
mental exercise for our better understanding.

> > Yet another thing I found confusing is you mentioning some high offsets
> > into a (logical) 64 KB BRAM.  Is this some extra BRAM you're using for
> > host to FPGA transfer only?
> 
> Yes, there is 64K shared BRAM used for host transfers (I used 64K because I
> thought more than 14 cores would fit). Host puts data for all cores into
> that BRAM using DMA. Than arbiter controls FSM in cores so that only one
> core fetches data from shared BRAM at time. After all cores transfer data
> from shared BRAM into local BRAMs, computation is started. And than arbiter
> again makes sure that cores store data one by one. All this time, host
> waits in polling loop checking software accessible register. This register
> is set to FF after all cores store data. At the end, DMA transfer from BRAM
> to host is done.

OK, this is clear.  We could improve upon this approach, but maybe we
don't need to, if we have BRAM blocks to waste anyway.  A concern,
though, is how many slices we're spending on the logic initializing the
per-core BRAMs, and whether that can be optimized.  We may look into
that a bit later.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.