john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140413082311.GB22732@openwall.com>
Date: Sun, 13 Apr 2014 12:23:11 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

Hi Katja,

Thank you for bringing this back to the list!

On Fri, Apr 11, 2014 at 06:54:39PM +0200, Katja Malvoni wrote:
> Here are some better news about this.

Great!

> Redesigning PS-PL communication resulted in improvement. I have working
> design with 70 bcrypt cores. Performance is 2162 c/s on 71 MHz frequency. 2
> cycles are needed for one Blowfish round. Computation on host is overlapped
> with computation on FPGA 5/6th of the time.

Off-list, you had reported 67 cores at 71 MHz doing 1895 c/s when you
had 4 cycles/round, or so I understood you (possibly incorrectly since
this data was split across multiple e-mails).  70 cores would be doing
something like 1895*70/67 = 1980 c/s.  At 2 cycles/round, the speed
should be almost twice that, but you're reporting "only" 2162 c/s.  Why
is that?

Are we possibly bumping into computation on the host, despite of the
5/6th overlap, now that you've halved the cycles per round?  If I'm not
mistaken, at $2a$05 the host's computation is at around 1.8% of FPGA's:

(512+64)/(512*2*32) = 0.01758

This corresponds to required minimum host speed to avoid it being the
bottleneck at:

2162*0.01758 = 38 c/s

We were actually getting 84 c/s with (semi-)optimized code on one of
these ARM cores.  This is 2x+ higher that the required minimum, yet I
think we may be close enough that, along with maybe less optimal code,
communication overhead and less than perfect overlapping we may be
seeing significant impact on the (mostly missing) speed improvement when
going from 4 to 2 cycles/round.

It makes sense to start by running some more benchmarks, though: what
speeds are you getting for 1 core (in FPGA), for the 2-cycle and 4-cycle
versions?  What speeds are you getting for $2a$08 (reduces relative
cost of host's computation by a factor of 8 compared to $2a$05)?

Once you ran the benchmarks above, you might want to try adding OpenMP
to use the second ARM core.

You might also want to try implementing almost the entire bcrypt in
FPGA, although chances are that at least initially this will result in
much bigger cores, so fewer will fit.  This is unlikely a good idea as
long as we're able to provide enough CPU power for roughly 2% of the
total processing power.  Yet it could be worth trying at some point.

> Utilization is:
> Number of Slice Registers:                     11,849 out of 106,400   11%
> Number of Slice LUTs:                           44,811 out of  53,200    84%
> Number of occupied Slices:                    12,914 out of  13,300    97%
> Number of RAMB36E1/FIFO36E1s:        140 out of       140        100%
> Number of BUFG/BUFGCTRLs:              2 out of           32         6%
> 
> I can't fit more than 70 cores, BRAM is the limiting resource. If I don't
> store P, expanded key, salt and cost in BRAM, I have to store it in
> distributed RAM in order to keep the communication the way it is now. I
> can't use AXI4 bus to store something in register, it has to be a memory
> with address bus, data in and data out buses and write enable signal
> (actually, when I implement it such that it uses write enable, it's
> synthesized as distributed RAM. And write enable is the only way I can tell
> is the host writing or reading). LUT utilization for this design was around
> 55% for 4 bcrypt cores.

Ouch.  I think there was still much room for optimization there, while
keeping those things in distributed RAM.

It might well be that spending two BRAM blocks per bcrypt core is the
most optimal configuration.  Yet what about sharing a BRAM block across
multiple cores - e.g., try one shared BRAM per two cores - for the tiny
things (P, etc.)?  You have two cycles/round, so clearly you're reading
from P on only one of these two cycles.  You could have a nearby core
read its P from the same BRAM on the other cycle.  Then you'd have three
BRAMs per two cores, so 46 pairs of cores, or the equivalent of 92
current cores, would fit in terms of BRAM.  Oh, you're at 97% for
Slices, so this is unlikely to allow for a core count increase...

> Code: git clone https://github.com/kmalvoni/JohnTheRipper -b master

Can you also post a summary of what work is done on those two cycles?

Are you still getting correct results on my ZedBoard only, but not on
yours (needing a lower core count for yours)?  And not on Parallella
board either?  I suspect the limited power / core voltage drop issue.
At 1.0 V core voltage, even a (peak) power usage of just 1.0 W means a
current of 1.0 A, so if e.g. a PCB trace has impedance of 0.1 Ohm (I
think this is too high, but not unrealistic) we might have a voltage
drop of 0.1 V right there, and that's 10% of total.  That's not even
considering limitations of the voltage regulator.  (I am assuming that
there's no voltage sense going back from the FPGA to the voltage
regulator.  I think there is not.)

As discussed off-list, I think you should also proceed with ztex board.
You mentioned that the documentation wasn't of sufficient help for you
to get communication going, right?  If so, suggest that you work
primarily from working code examples, such as those for Bitcoin and
Litecoin mining, as well as with the vendor's SDK examples.

Overall, I am happy about the progress you're making at this project.

Thanks again,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.