Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 14 Apr 2014 16:36:06 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Mon, Apr 14, 2014 at 03:53:50PM +0400, Solar Designer wrote:
> I think it might make sense to interleave multiple instances of bcrypt
> per core until you're making full use of all BRAM ports for computation.
> 
> With 4 bcrypt instances per core, you need 20 reads per round.  With 2
> cycles/round, that's 10 reads per cycle, needing 5 BRAMs.  Maybe you can
> have:
> 
> Cycle 0:
> initiate S0, S1 lookups for instances 0, 1 (total: 4 lookups)
> initiate S2, S3 lookups for instances 2, 3 (total: 4 lookups)
> initiate P lookups for instances 0, 1 (total: 2 lookups)
> (total: 10 lookups)
> Cycle 1:
> initiate S2, S3 lookups for instances 0, 1 (total: 4 lookups)
> initiate S0, S1 lookups for instances 2, 3 (total: 4 lookups)
> initiate P lookups for instances 2, 3 (total: 2 lookups)
> (total: 10 lookups)
> 
> with the computation also spread across the two cycles as appropriate
> (and maybe you can reuse the same 32-bit adders across bcrypt instances,
> although the cost of extra MUXes is likely to kill the advantage).
> 
> Expanding this to 3 cycles/round and 6 instances/core also makes sense,
> to allow for higher clock rate: not requiring the data to be available
> on the next clock cycle, but only 1 cycle later.  I recall reading that
> Xilinx BRAMs support output registers for that.

It appears that with this extra cycle of latency, you can do either 3
instances/core and have 4 BRAMs per core (so 4 BRAMs per 3 instances)
with one port free for initialization use (for the BRAM holding P and
misc. data only), or 6 instances/core and have 7 BRAMs per core (so
7 BRAMs per 6 instances) with no free ports.  Either way, it is unclear
if the extra cycle of latency will allow for a clock rate increase by
more than 50% (as needed to compensate for and even benefit from this
extra latency) or not.

> It'd be fine to proceed with these additional optimizations after moving
> to ztex.  (Perhaps the optimizations can then be backported to the Zynq
> on ZedBoard platform, just to have "final" speed figures for it.)

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ