Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 6 Jul 2014 16:16:05 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Sun, Jul 06, 2014 at 01:20:20PM +0200, Katja Malvoni wrote:
> On 6 July 2014 12:55, Solar Designer <solar@...nwall.com> wrote:
> > Maybe you can use pairs of 32-bit integer or individual 64-bit integer
> > reads in place of multiple memcpy()'s.
> 
> I'm not sure I understand. I'm using mmaped memory space to access bcrypt
> logic so if I'm not mistaken, the only way I can read data from that space
> is by copying it using memcpy(). Or there is another way to perform those
> reads?

memcpy() isn't in any way special.  You can simply load integers via
pointers.  What matters is that the variables you read from (the pointer
target type) are declared volatile, and that there's a barrier to tell
the compiler these shouldn't be loaded before you've confirmed the
computation has completed and results were sent from the other end.
An empty asm("" ::: "memory"); statement should do the trick.  As
discussed before, there may also be platform-specific caching issues,
but you don't avoid them by using memcpy() anyway.

> > Perhaps you can achieve a higher clock rate by introducing an extra
> > cycle of latency per data transfer during initialization and maybe
> > during transfer of hashes to host as well?  Anyway, maybe it's better
> > to consider that after you've implemented initialization within each
> > core as I suggested.  It depends on which of these things you feel
> > you'll have ready before your trip to the US.
> 
> I'm not sure that would help. Routing delay is 90.4% of longest path delay
> and I can't use any frequency, just ones that can be derived from PS. So
> the next one after 71.4 MHz is 76.9 MHz. With 90% of delay being routing I
> don't think it is possible to improve logic to achieve 76.9 MHz.

I don't understand your line of thinking here.  It feels wrong to me.

> All these
> wires are connected to the same AXI bus and distributed along entire FPGA
> since AXI bus must access BRAMs and every bcrypt instance must access the
> same BRAMs. In this case, that extra registers need to be on the BRAM
> inputs and outputs which directly impacts bcrypt computation, namely delays
> when loading data from S-boxes.

I think you could have separate flip-flops to introduce a cycle of
latency just for the initialization, but this would consume resources.
It may in fact be better to have that extra cycle for all BRAM accesses,
and use the approach I suggested in April, where you first implement
1 cycle/round and then turn that into 2 cycles/round with doubling of
bcrypt instance count per core (as well as total), so you get that extra
cycle of latency from there for free and achieve a higher clock rate.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.