john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EaD-aZRE720Nw9OtHBVrKUF-__Q1_GTtJeLVwx1esS8X32qw@mail.gmail.com>
Date: Wed, 30 Oct 2013 18:55:10 +0100
From: Katja Malvoni <kmalvoni@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Wed, Oct 30, 2013 at 6:42 PM, Solar Designer <solar@...nwall.com> wrote:

> > So I store each S-box in two BRAM blocks in order to have all 4 values
> > after 2 cycles of delay.
>
> It is unclear what you mean by "each S-box" above.  Blowfish uses four
> S-boxes.  Do you want to say that you're storing each of the four twice,
> for a total of 8 S-boxes stored (two dupes of each)?  If so, I see no
> reason for you to be doing that.  In order to have all 4 values after
> just one lookup's latency, you simply need to store two S-boxes in one
> BRAM block and two more in another, with no data duplication.  Is this
> what you're actually doing?  If so, this makes sense to me, but your
> wording above (and in a previous message) is inconsistent with it.
>

I'm using wrong wording - I was calling S[4][0x100] one S-box. So in
correct wording: I am storing 4 S-boxes in one BRAM and than again same 4
S-boxes in another BRAM which is total of 8 S-boxes. I'll change this to 2
S-boxes in one BRAM and two in another one.


> In fact, as I explained in another message earlier today, you probably
> don't even have to waste BRAMs with this approach, if you interleave two
> instances of bcrypt per core.  That said, it is in fact fine to continue
> to waste BRAMs if LUTs remain the scarce resource.
>
> > > Also, weren't you already at 3 cycles per Blowfish round before this
> > > change?  If not, then how many cycles per Blowfish round did you have
> in
> > > the revision that achieved ~300 c/s using 14 cores at 100 MHz?
> >
> > No, it was 5 cycles -
> > Cycle 0: initiate 2 S-box lookups
> > Cycle 1: wait
> > Cycle 2: initiate other 2 S-box lookups, compute tmp
> > Cycle 3: wait
> > Cycle 4: compute new L, swap L and R
>
> Did anything prevent you from initiating the second two lookups shown
> at Cycle 2 above, on Cycle 1 instead?


I focused on getting single cycle delay RAM to work so I didn't even try to
further optimize the code.


> So you'd have:
>
> Cycle 0: initiate 2 S-box lookups
> Cycle 1: initiate other 2 S-box lookups, compute tmp
> Cycle 2: read first two lookup results from BRAM's outputs
> Cycle 3: compute new L, swap L and R
>
> or:
>
> Cycle 0: initiate 2 S-box lookups
> Cycle 1: initiate other 2 S-box lookups, compute tmp
> Cycle 2: start computation of new L using first two lookup results
> Cycle 3: compute new L, swap L and R
>

This should work fine.


>
> Indeed, I am not suggesting to go from 3 cycles to 4.  This is just a
> mental exercise for our better understanding.
>
> > > Yet another thing I found confusing is you mentioning some high offsets
> > > into a (logical) 64 KB BRAM.  Is this some extra BRAM you're using for
> > > host to FPGA transfer only?
> >
> > Yes, there is 64K shared BRAM used for host transfers (I used 64K
> because I
> > thought more than 14 cores would fit). Host puts data for all cores into
> > that BRAM using DMA. Than arbiter controls FSM in cores so that only one
> > core fetches data from shared BRAM at time. After all cores transfer data
> > from shared BRAM into local BRAMs, computation is started. And than
> arbiter
> > again makes sure that cores store data one by one. All this time, host
> > waits in polling loop checking software accessible register. This
> register
> > is set to FF after all cores store data. At the end, DMA transfer from
> BRAM
> > to host is done.
>
> OK, this is clear.  We could improve upon this approach, but maybe we
> don't need to, if we have BRAM blocks to waste anyway.  A concern,
> though, is how many slices we're spending on the logic initializing the
> per-core BRAMs, and whether that can be optimized.  We may look into
> that a bit later.
>

With only one core utilization is:
Register: 5%
LUT: 15%
Slice: 25%
RAMB36E1: 6%
RAMB18E1: 1%
BUFG: 3%

Two AXI buses and DMA take away some space (I think around 20% of Slice
utilization). I'll try to think about other possible ways of host FPGA
communication.

Katja

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.