john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140414123606.GA27566@openwall.com>
Date: Mon, 14 Apr 2014 16:36:06 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Mon, Apr 14, 2014 at 03:53:50PM +0400, Solar Designer wrote:
> I think it might make sense to interleave multiple instances of bcrypt
> per core until you're making full use of all BRAM ports for computation.
> 
> With 4 bcrypt instances per core, you need 20 reads per round.  With 2
> cycles/round, that's 10 reads per cycle, needing 5 BRAMs.  Maybe you can
> have:
> 
> Cycle 0:
> initiate S0, S1 lookups for instances 0, 1 (total: 4 lookups)
> initiate S2, S3 lookups for instances 2, 3 (total: 4 lookups)
> initiate P lookups for instances 0, 1 (total: 2 lookups)
> (total: 10 lookups)
> Cycle 1:
> initiate S2, S3 lookups for instances 0, 1 (total: 4 lookups)
> initiate S0, S1 lookups for instances 2, 3 (total: 4 lookups)
> initiate P lookups for instances 2, 3 (total: 2 lookups)
> (total: 10 lookups)
> 
> with the computation also spread across the two cycles as appropriate
> (and maybe you can reuse the same 32-bit adders across bcrypt instances,
> although the cost of extra MUXes is likely to kill the advantage).
> 
> Expanding this to 3 cycles/round and 6 instances/core also makes sense,
> to allow for higher clock rate: not requiring the data to be available
> on the next clock cycle, but only 1 cycle later.  I recall reading that
> Xilinx BRAMs support output registers for that.

It appears that with this extra cycle of latency, you can do either 3
instances/core and have 4 BRAMs per core (so 4 BRAMs per 3 instances)
with one port free for initialization use (for the BRAM holding P and
misc. data only), or 6 instances/core and have 7 BRAMs per core (so
7 BRAMs per 6 instances) with no free ports.  Either way, it is unclear
if the extra cycle of latency will allow for a clock rate increase by
more than 50% (as needed to compensate for and even benefit from this
extra latency) or not.

> It'd be fine to proceed with these additional optimizations after moving
> to ztex.  (Perhaps the optimizations can then be backported to the Zynq
> on ZedBoard platform, just to have "final" speed figures for it.)

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.