Date: Wed, 30 Oct 2013 17:31:48 +0100 From: Katja Malvoni <kmalvoni@...il.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt On Wed, Oct 30, 2013 at 4:23 PM, Solar Designer <solar@...nwall.com> wrote: > On Wed, Oct 30, 2013 at 02:07:28PM +0100, Katja Malvoni wrote: > > On Wed, Oct 30, 2013 at 10:17 AM, Solar Designer <solar@...nwall.com> > wrote: > > > On Tue, Oct 29, 2013 at 07:48:35PM +0100, Katja Malvoni wrote: > > > > At the moment performance is 602 c/s, maximum frequency is 100 MHz. > > > > > > What has contributed to doubling the performance (since your previous > > > report)? I guess it could be performing the 4 S-box lookups all at > > > once, but then you're giving high numbers of cycles per round anyway: > > > > That is correct, since most of the RAM is unused I'm storing each S-box > > twice. > > Now I am confused. Why would you need to store each S-box twice in > order to perform all four lookups at once? I think it'd be sufficient > for you to allocate no more than two S-boxes per BRAM (since you have at > most two ports per BRAM, at least physical), but I don't see any need > for data duplication. > If I use two ports of the BRAM than I can get only two words out and I need 4. So I store each S-box in two BRAM blocks in order to have all 4 values after 2 cycles of delay. Since LUT utilization is a problem, wasting some BRAM to get faster seemed okay. Also, weren't you already at 3 cycles per Blowfish round before this > change? If not, then how many cycles per Blowfish round did you have in > the revision that achieved ~300 c/s using 14 cores at 100 MHz? No, it was 5 cycles - Cycle 0: initiate 2 S-box lookups Cycle 1: wait Cycle 2: initiate other 2 S-box lookups, compute tmp Cycle 3: wait Cycle 4: compute new L, swap L and R > You > still have 14 cores now, right? > Yes, there are still 14 cores. > Yet another thing I found confusing is you mentioning some high offsets > into a (logical) 64 KB BRAM. Is this some extra BRAM you're using for > host to FPGA transfer only? > Yes, there is 64K shared BRAM used for host transfers (I used 64K because I thought more than 14 cores would fit). Host puts data for all cores into that BRAM using DMA. Than arbiter controls FSM in cores so that only one core fetches data from shared BRAM at time. After all cores transfer data from shared BRAM into local BRAMs, computation is started. And than arbiter again makes sure that cores store data one by one. All this time, host waits in polling loop checking software accessible register. This register is set to FF after all cores store data. At the end, DMA transfer from BRAM to host is done. Katja Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.