Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 8 Sep 2013 15:39:14 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard / Parallella: bcrypt

Katja,

On Sun, Sep 08, 2013 at 10:15:57AM +0200, Katja Malvoni wrote:
> Xilinx tools offers two types of block RAM to be used, one is native and
> another one is with AXI4 interface. I need bram with AXI4 because I use
> that bram to communicate with PS. I think that native bram has 1 cycles
> latency, I'm not sure.
> Another option is having inferred native dual port bram and copying
> everything from bram used to communicate with PS to internal bram. This
> will save time but it will have impact on area.

My guess is that the latter will work better, especially at higher
bcrypt cost settings (not $2a$05, which we use as a reference for
benchmarking, but the currently more realistic $2a$08 or higher).

I think we may need to optimally (de-)synchronize the bcrypt cores, so
that they comfortably get their initial S- and P-box values transferred
to them sequentially (such as from a few other block RAMs that we'd
allocate to this purpose globally, shared between the cores).

> > Yes, when you mentioned having used a port to do DMA a while ago, this
> > felt wasteful to me - and now you confirm that it is.  Perhaps you
> > should reconsider that?  With DMA, you may be making data transfers
> > from/to host slightly faster, but you're probably almost halving the
> > computation speed by wasting half the block RAM ports.
> 
> The problem is that I don't know how else to communicate with PL. There is
> no reference from Xilinx and connecting my logic to other RAM port was the
> only way I could access data sent from host. Another way would be to make
> second bram port external and than to connect it to user logic but still it
> would use only one port. It took me a lot of time to figure out how to do
> that communication and I really don't have any other idea except this one.

OK, you may continue with the approach you've already learned, but
perhaps apply this approach to a (to be) shared block RAM, not to the
block RAM used during bcrypt computation: you'll need to use both ports
of this one for computation, and you need the 1-cycle latency too.
IIRC, Yuri's code was doing two S-box loads per cycle (via the two
ports), with the data becoming available for computation the next cycle.
With proper interleaving (perhaps two instances of bcrypt per core),
you'd keep all logic busy (doing useful work) on every cycle.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.