Date: Wed, 30 Oct 2013 21:42:41 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt On Wed, Oct 30, 2013 at 05:31:48PM +0100, Katja Malvoni wrote: > If I use two ports of the BRAM than I can get only two words out and I need > 4. Sure. > So I store each S-box in two BRAM blocks in order to have all 4 values > after 2 cycles of delay. It is unclear what you mean by "each S-box" above. Blowfish uses four S-boxes. Do you want to say that you're storing each of the four twice, for a total of 8 S-boxes stored (two dupes of each)? If so, I see no reason for you to be doing that. In order to have all 4 values after just one lookup's latency, you simply need to store two S-boxes in one BRAM block and two more in another, with no data duplication. Is this what you're actually doing? If so, this makes sense to me, but your wording above (and in a previous message) is inconsistent with it. > Since LUT utilization is a problem, wasting some > BRAM to get faster seemed okay. Sure. In fact, as I explained in another message earlier today, you probably don't even have to waste BRAMs with this approach, if you interleave two instances of bcrypt per core. That said, it is in fact fine to continue to waste BRAMs if LUTs remain the scarce resource. > > Also, weren't you already at 3 cycles per Blowfish round before this > > change? If not, then how many cycles per Blowfish round did you have in > > the revision that achieved ~300 c/s using 14 cores at 100 MHz? > > No, it was 5 cycles - > Cycle 0: initiate 2 S-box lookups > Cycle 1: wait > Cycle 2: initiate other 2 S-box lookups, compute tmp > Cycle 3: wait > Cycle 4: compute new L, swap L and R Did anything prevent you from initiating the second two lookups shown at Cycle 2 above, on Cycle 1 instead? So you'd have: Cycle 0: initiate 2 S-box lookups Cycle 1: initiate other 2 S-box lookups, compute tmp Cycle 2: read first two lookup results from BRAM's outputs Cycle 3: compute new L, swap L and R or: Cycle 0: initiate 2 S-box lookups Cycle 1: initiate other 2 S-box lookups, compute tmp Cycle 2: start computation of new L using first two lookup results Cycle 3: compute new L, swap L and R Indeed, I am not suggesting to go from 3 cycles to 4. This is just a mental exercise for our better understanding. > > Yet another thing I found confusing is you mentioning some high offsets > > into a (logical) 64 KB BRAM. Is this some extra BRAM you're using for > > host to FPGA transfer only? > > Yes, there is 64K shared BRAM used for host transfers (I used 64K because I > thought more than 14 cores would fit). Host puts data for all cores into > that BRAM using DMA. Than arbiter controls FSM in cores so that only one > core fetches data from shared BRAM at time. After all cores transfer data > from shared BRAM into local BRAMs, computation is started. And than arbiter > again makes sure that cores store data one by one. All this time, host > waits in polling loop checking software accessible register. This register > is set to FF after all cores store data. At the end, DMA transfer from BRAM > to host is done. OK, this is clear. We could improve upon this approach, but maybe we don't need to, if we have BRAM blocks to waste anyway. A concern, though, is how many slices we're spending on the logic initializing the per-core BRAMs, and whether that can be optimized. We may look into that a bit later. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.