Date: Tue, 22 Apr 2014 02:41:47 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt Katja, On Mon, Apr 14, 2014 at 04:38:57PM +0400, Solar Designer wrote: > On Mon, Apr 14, 2014 at 03:53:50PM +0400, Solar Designer wrote: > > With 4 bcrypt instances per core, you need 20 reads per round. With 2 > > cycles/round, that's 10 reads per cycle, needing 5 BRAMs. Maybe you can > > have: > > > > Cycle 0: > > initiate S0, S1 lookups for instances 0, 1 (total: 4 lookups) > > initiate S2, S3 lookups for instances 2, 3 (total: 4 lookups) > > initiate P lookups for instances 0, 1 (total: 2 lookups) > > (total: 10 lookups) > > Cycle 1: > > initiate S2, S3 lookups for instances 0, 1 (total: 4 lookups) > > initiate S0, S1 lookups for instances 2, 3 (total: 4 lookups) > > initiate P lookups for instances 2, 3 (total: 2 lookups) > > (total: 10 lookups) > > > > with the computation also spread across the two cycles as appropriate > > (and maybe you can reuse the same 32-bit adders across bcrypt instances, > > although the cost of extra MUXes is likely to kill the advantage). > > BTW, in the example above the data fits into different BRAMs in an > intuitive manner: each one of the four bcrypt instances can have its > S-boxes in its own BRAM block, with nothing else in it. The four > P-boxes all fit in a fifth BRAM block. So this may be the winning > approach in terms of simplicity and elegance. It occurred to me (earlier today) that in the above approach we're unnecessarily wasting half of the parallelism inherent to bcrypt: if we can do all 5 lookups (4 S-box lookups and 1 P-box lookup) simultaneously, then probably we should. When we're artificially splitting this across two cycles, we're then wasting half the BRAM space to regain the equivalent amount of parallelism by bringing in more instances. We're probably also increasing communication delays as a side-effect of having more instances, as you noted. What we can probably have instead is a single-cycle implementation of bcrypt instance, so we'll fully use BRAM ports with only 2 instances/core: On each cycle: Compute 2x 4 S-box lookup indices from previous cycle's lookup results Initiate 2x 4 S-box lookups from 4 BRAMs Initiate 2x P-box lookups from 1 BRAM Now, the computation might be slightly slower: it's add, xor, add done sequentially on the same cycle, right before the BRAM lookups. However, I think it's still fast enough to keep the clock rate sane, and on the bright side we're keeping this circuit busy on every clock cycle, so we need fewer circuits like this for the same total performance. Overall, I expect a reduction in LUT utilization compared to the 4 instances/core approach, which might be handy in terms of power consumption reduction on Zynq and/or in terms of getting things to fit on Spartan-6 LX150 (not sure whether BRAMs or LUTs will turn out to be the scarce resource for us there). In terms of expected performance excluding communication and host overhead and assuming the same clock rate, this should be exactly the same as your current 28 cores, 112 instances. We'll have 28 cores too, but they will contain only 56 instances (and will use less space per BRAM for now). The reduction in instance count is fully compensated for by the reduction from 2 cycles/round to 1 cycle/round. What we gain is needing to communicate with fewer instances, maybe reducing the overhead. We also gain the likely reduction in LUT utilization mentioned above, and most importantly: A possible next step is to introduce 4 instances/core on top of this, gaining an extra cycle of latency that we can use to increase the clock rate. One way to use this extra cycle is to move the entire add/xor/add thing onto this cycle (interleave the four instances such that when two do lookups, the other two do computation, and vice versa). Another way to use it is to enable BRAMs' output registers. Either of these may allow for a higher clock rate (and it's unclear which one is faster). So that's another 3 code versions to try (56 1-cycle instances, and 112 2-cycle instances done in two different ways). On the bright side, I think there's no longer a reason to try the 6/7 idea I had posted last week, which would require a clock rate increase by a factor of more than 1.5*112/120 = 1.4 to achieve any gain. This new idea above lets us have the extra cycle and the associated clock rate increase without any penalty in terms of throughput per cycle. Please feel free to experiment with this on Ztex board or/and ZedBoard when convenient. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.