Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 30 Oct 2013 17:31:48 +0100
From: Katja Malvoni <kmalvoni@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Wed, Oct 30, 2013 at 4:23 PM, Solar Designer <solar@...nwall.com> wrote:

> On Wed, Oct 30, 2013 at 02:07:28PM +0100, Katja Malvoni wrote:
> > On Wed, Oct 30, 2013 at 10:17 AM, Solar Designer <solar@...nwall.com>
> wrote:
> > > On Tue, Oct 29, 2013 at 07:48:35PM +0100, Katja Malvoni wrote:
> > > > At the moment performance is 602 c/s, maximum frequency is 100 MHz.
> > >
> > > What has contributed to doubling the performance (since your previous
> > > report)?  I guess it could be performing the 4 S-box lookups all at
> > > once, but then you're giving high numbers of cycles per round anyway:
> >
> > That is correct, since most of the RAM is unused I'm storing each S-box
> > twice.
>
> Now I am confused.  Why would you need to store each S-box twice in
> order to perform all four lookups at once?  I think it'd be sufficient
> for you to allocate no more than two S-boxes per BRAM (since you have at
> most two ports per BRAM, at least physical), but I don't see any need
> for data duplication.
>

If I use two ports of the BRAM than I can get only two words out and I need
4. So I store each S-box in two BRAM blocks in order to have all 4 values
after 2 cycles of delay. Since LUT utilization is a problem, wasting some
BRAM to get faster seemed okay.

Also, weren't you already at 3 cycles per Blowfish round before this
> change?  If not, then how many cycles per Blowfish round did you have in
> the revision that achieved ~300 c/s using 14 cores at 100 MHz?


No, it was 5 cycles -
Cycle 0: initiate 2 S-box lookups
Cycle 1: wait
Cycle 2: initiate other 2 S-box lookups, compute tmp
Cycle 3: wait
Cycle 4: compute new L, swap L and R


> You
> still have 14 cores now, right?
>

Yes, there are still 14 cores.


> Yet another thing I found confusing is you mentioning some high offsets
> into a (logical) 64 KB BRAM.  Is this some extra BRAM you're using for
> host to FPGA transfer only?
>

Yes, there is 64K shared BRAM used for host transfers (I used 64K because I
thought more than 14 cores would fit). Host puts data for all cores into
that BRAM using DMA. Than arbiter controls FSM in cores so that only one
core fetches data from shared BRAM at time. After all cores transfer data
from shared BRAM into local BRAMs, computation is started. And than arbiter
again makes sure that cores store data one by one. All this time, host
waits in polling loop checking software accessible register. This register
is set to FF after all cores store data. At the end, DMA transfer from BRAM
to host is done.

Katja

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.