Date: Wed, 30 Oct 2013 18:55:10 +0100 From: Katja Malvoni <kmalvoni@...il.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt On Wed, Oct 30, 2013 at 6:42 PM, Solar Designer <solar@...nwall.com> wrote: > > So I store each S-box in two BRAM blocks in order to have all 4 values > > after 2 cycles of delay. > > It is unclear what you mean by "each S-box" above. Blowfish uses four > S-boxes. Do you want to say that you're storing each of the four twice, > for a total of 8 S-boxes stored (two dupes of each)? If so, I see no > reason for you to be doing that. In order to have all 4 values after > just one lookup's latency, you simply need to store two S-boxes in one > BRAM block and two more in another, with no data duplication. Is this > what you're actually doing? If so, this makes sense to me, but your > wording above (and in a previous message) is inconsistent with it. > I'm using wrong wording - I was calling S[0x100] one S-box. So in correct wording: I am storing 4 S-boxes in one BRAM and than again same 4 S-boxes in another BRAM which is total of 8 S-boxes. I'll change this to 2 S-boxes in one BRAM and two in another one. > In fact, as I explained in another message earlier today, you probably > don't even have to waste BRAMs with this approach, if you interleave two > instances of bcrypt per core. That said, it is in fact fine to continue > to waste BRAMs if LUTs remain the scarce resource. > > > > Also, weren't you already at 3 cycles per Blowfish round before this > > > change? If not, then how many cycles per Blowfish round did you have > in > > > the revision that achieved ~300 c/s using 14 cores at 100 MHz? > > > > No, it was 5 cycles - > > Cycle 0: initiate 2 S-box lookups > > Cycle 1: wait > > Cycle 2: initiate other 2 S-box lookups, compute tmp > > Cycle 3: wait > > Cycle 4: compute new L, swap L and R > > Did anything prevent you from initiating the second two lookups shown > at Cycle 2 above, on Cycle 1 instead? I focused on getting single cycle delay RAM to work so I didn't even try to further optimize the code. > So you'd have: > > Cycle 0: initiate 2 S-box lookups > Cycle 1: initiate other 2 S-box lookups, compute tmp > Cycle 2: read first two lookup results from BRAM's outputs > Cycle 3: compute new L, swap L and R > > or: > > Cycle 0: initiate 2 S-box lookups > Cycle 1: initiate other 2 S-box lookups, compute tmp > Cycle 2: start computation of new L using first two lookup results > Cycle 3: compute new L, swap L and R > This should work fine. > > Indeed, I am not suggesting to go from 3 cycles to 4. This is just a > mental exercise for our better understanding. > > > > Yet another thing I found confusing is you mentioning some high offsets > > > into a (logical) 64 KB BRAM. Is this some extra BRAM you're using for > > > host to FPGA transfer only? > > > > Yes, there is 64K shared BRAM used for host transfers (I used 64K > because I > > thought more than 14 cores would fit). Host puts data for all cores into > > that BRAM using DMA. Than arbiter controls FSM in cores so that only one > > core fetches data from shared BRAM at time. After all cores transfer data > > from shared BRAM into local BRAMs, computation is started. And than > arbiter > > again makes sure that cores store data one by one. All this time, host > > waits in polling loop checking software accessible register. This > register > > is set to FF after all cores store data. At the end, DMA transfer from > BRAM > > to host is done. > > OK, this is clear. We could improve upon this approach, but maybe we > don't need to, if we have BRAM blocks to waste anyway. A concern, > though, is how many slices we're spending on the logic initializing the > per-core BRAMs, and whether that can be optimized. We may look into > that a bit later. > With only one core utilization is: Register: 5% LUT: 15% Slice: 25% RAMB36E1: 6% RAMB18E1: 1% BUFG: 3% Two AXI buses and DMA take away some space (I think around 20% of Slice utilization). I'll try to think about other possible ways of host FPGA communication. Katja Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.