john-dev - Re: ZedBoard / Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130907224212.GA12946@openwall.com>
Date: Sun, 8 Sep 2013 02:42:12 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard / Parallella: bcrypt

Katja,

On Sat, Sep 07, 2013 at 11:47:49AM +0200, Katja Malvoni wrote:
> I have implementation of bcrypt's most costly loop and behavioral
> simulation gives correct results. But when I put it in user part of IP core
> generated by Create and import peripheral wizard it produces incorrect
> result (code is attached, user_logic.v).
> And I can't figure out why. If I write something to block RAM from PL, host
> reads correct data. If I don't modify the contents of memory, host reads
> the same unmodified data. If PL reads data from one location in block RAM
> and writes to another one, host reads expected value. My guess was that
> there is a problem with reading/writing to block RAM from PL but when there
> is no computation implemented in logic but only reads and writes it works
> as expected. Does anyone have an idea how to debug this?
> Changing portions of code to see what would happen isn't practical because
> bitstream generation takes 20 minutes.

I currently don't have an idea better than trying to bisect it - rather
than changing small portions of code, try to keep roughly half of the
computation in there (and do the same in a "reference" implementation
in C in order to have the expected correct outputs).  That way, you may
be able to identify which "half" has the issue, or whether both do.
If only one produces incorrect results, then split it in "halves" again.
That way, you wouldn't need to generate the bitstream more than a few
times until you arrive at a fairly small piece of code that still has
the issue - which you might then be able to spot far more easily.

Also, to remind you, we already had Yuri's bcrypt on FPGA working
correctly (including on the actual Spartan-6 device) - so maybe you
could have started with his code - or you may use it as a reference.
I am fine with you starting from scratch, though.

http://openwall.info/wiki/crypt-dev/files

> Current implementation is attached (bcrypt_loop.v) and it's too slow -
> 7652913 clock cycles for cost 5. It comes mainly from memory latency. 3
> cycles are needed for a read from memory.

3 cycles is a lot.  IIRC, it was 1 cycle on Spartan-6 and Virtex-6 when
we experimented with bcrypt on those, and it could be 2 cycles with
buffer registers if we wanted those (presumably for a higher clock rate).
Is Zynq worse in that aspect?

Anyhow, a way to deal with latencies is by interleaving of multiple
bcrypt instances per core.  You do have to allocate separate block RAMs
per instance, but several instances (forming one core) can share most of
the rest of the logic.

http://www.openwall.com/lists/crypt-dev/2011/08/21/1

Which approach is optimal will depend on what ends up being the scarce
resource, limiting the overall speed per chip (with as many cores as
will fit).  It can be block RAMs, or it can be LUTs, or it can be
something else.

> And I can use only one port because the other one is used by DMA.

Yes, when you mentioned having used a port to do DMA a while ago, this
felt wasteful to me - and now you confirm that it is.  Perhaps you
should reconsider that?  With DMA, you may be making data transfers
from/to host slightly faster, but you're probably almost halving the
computation speed by wasting half the block RAM ports.

Is it by any chance possible to use the same block RAM ports for both
DMA and PL access, at different times?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.