john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131030180731.GA29621@openwall.com>
Date: Wed, 30 Oct 2013 22:07:31 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Wed, Oct 30, 2013 at 06:55:10PM +0100, Katja Malvoni wrote:
> On Wed, Oct 30, 2013 at 6:42 PM, Solar Designer <solar@...nwall.com> wrote:
> 
> > > So I store each S-box in two BRAM blocks in order to have all 4 values
> > > after 2 cycles of delay.
> >
> > It is unclear what you mean by "each S-box" above.  Blowfish uses four
> > S-boxes.  Do you want to say that you're storing each of the four twice,
> > for a total of 8 S-boxes stored (two dupes of each)?  If so, I see no
> > reason for you to be doing that.  In order to have all 4 values after
> > just one lookup's latency, you simply need to store two S-boxes in one
> > BRAM block and two more in another, with no data duplication.  Is this
> > what you're actually doing?  If so, this makes sense to me, but your
> > wording above (and in a previous message) is inconsistent with it.
> 
> I'm using wrong wording - I was calling S[4][0x100] one S-box. So in
> correct wording: I am storing 4 S-boxes in one BRAM and than again same 4
> S-boxes in another BRAM which is total of 8 S-boxes. I'll change this to 2
> S-boxes in one BRAM and two in another one.

OK.

Regarding data transfers:
> > OK, this is clear.  We could improve upon this approach, but maybe we
> > don't need to, if we have BRAM blocks to waste anyway.  A concern,
> > though, is how many slices we're spending on the logic initializing the
> > per-core BRAMs, and whether that can be optimized.  We may look into
> > that a bit later.
> 
> With only one core utilization is:
> Register: 5%
> LUT: 15%
> Slice: 25%
> RAMB36E1: 6%
> RAMB18E1: 1%
> BUFG: 3%

Thanks for this utilization data.  Note that there's probably quite some
per-core overhead, including in Slice utilization, for the initialization
of the per-core BRAMs from the BRAM that you use for data transfer from
host.  You probably have mux'es in each core, since you're already using
all of the per-core BRAMs' ports for computation.  ... or are there
write ports separate from the read ports that you use for computation?

> Two AXI buses and DMA take away some space (I think around 20% of Slice
> utilization). I'll try to think about other possible ways of host FPGA
> communication.

OK, but:

I am not concerned about the 25% Slice utilization above as much as I am
about how much of the remaining 75% is possibly consumed by the overhead
needed to initialize the per-core BRAMs.  I wouldn't be surprised if
e.g. one third of the remaining 75% is consumed by such overhead.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.