john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140414115350.GA27336@openwall.com>
Date: Mon, 14 Apr 2014 15:53:51 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

On Mon, Apr 14, 2014 at 12:28:09PM +0200, Katja Malvoni wrote:
> On 13 April 2014 10:23, Solar Designer <solar@...nwall.com> wrote:
> > 70 cores would be doing
> > something like 1895*70/67 = 1980 c/s.  At 2 cycles/round, the speed
> > should be almost twice that, but you're reporting "only" 2162 c/s.  Why
> > is that?
> 
> The problem is computation on host. Lines 649 to 681 in
> https://github.com/kmalvoni/JohnTheRipper/blob/master/src/BF_std.c#L649take
> 0.016221s to compute while FPGA computation takes 0.011521s. Another
> problem are data transfers to/from FPGA. Each transfer takes
> around 0.006414s (these numbers are for cost 5). First all the data gets
> ready and transferred to FPGA and after that computation is started. When
> all 70 cores finish, data is transferred back to host.
> I should change that so that cores start computing as soon as data is ready.

Yes, at current speeds it became more important to increase the overlap
between host and FPGA computation.

> > It makes sense to start by running some more benchmarks, though: what
> > speeds are you getting for 1 core (in FPGA), for the 2-cycle and 4-cycle
> > versions?  What speeds are you getting for $2a$08 (reduces relative
> > cost of host's computation by a factor of 8 compared to $2a$05)?
> 
> I can't test speed for 1 core at the moment - I'm not able to ssh to the
> Zed system,

It was locked up.  Looking at the LEDs, I saw it no longer had one of
your custom bitstreams loaded - maybe it got rebooted but didn't fully
start up somehow, or maybe you deliberately loaded a Parallella
bitstream back and that somehow failed.

I've just power-cycled it, and it should be back up.

> the error I get is connection refused.

The "connection refused" came from another machine that acts as a gateway.

> The test fails on my ZedBoard.

Do you mean the 2-cycle version doesn't work on your ZedBoard at all,
even for reduced core count?

What about your Parallella board?

> For 4-cycle version, $2a$08 speed is 425 c/s and $2a$12 speed is 32 c/s
> (tested on my ZedBoard, self test passes but not all instances return
> correct results).

These are good speeds.  425 c/s at $2a$08 would be 3400 c/s at $2a$05 if
we fully avoided the extra overhead.  32 c/s at $2a$12 would be 4096 c/s
at $2a$05.

Is the 2-cycle version twice faster than that?  If so, it already
outperforms quad-core x86 CPUs at these higher cost settings.

> cycle 0: compute tmp; initiate 2 S-box lookups
> cycle 1: compute new R, L; initiate 2 S-box lookups; initiate P-box lookup

With two dual-port BRAM blocks per core, you're able to initiate a total
of 8 reads per 2 clock cycles.  You're making use of 5 out of 8.  Are
the remaining ports already in use for initialization?  In other words,
if you make more use of them, would you incur extra MUXes?

I think it might make sense to interleave multiple instances of bcrypt
per core until you're making full use of all BRAM ports for computation.

With 4 bcrypt instances per core, you need 20 reads per round.  With 2
cycles/round, that's 10 reads per cycle, needing 5 BRAMs.  Maybe you can
have:

Cycle 0:
initiate S0, S1 lookups for instances 0, 1 (total: 4 lookups)
initiate S2, S3 lookups for instances 2, 3 (total: 4 lookups)
initiate P lookups for instances 0, 1 (total: 2 lookups)
(total: 10 lookups)
Cycle 1:
initiate S2, S3 lookups for instances 0, 1 (total: 4 lookups)
initiate S0, S1 lookups for instances 2, 3 (total: 4 lookups)
initiate P lookups for instances 2, 3 (total: 2 lookups)
(total: 10 lookups)

with the computation also spread across the two cycles as appropriate
(and maybe you can reuse the same 32-bit adders across bcrypt instances,
although the cost of extra MUXes is likely to kill the advantage).

Expanding this to 3 cycles/round and 6 instances/core also makes sense,
to allow for higher clock rate: not requiring the data to be available
on the next clock cycle, but only 1 cycle later.  I recall reading that
Xilinx BRAMs support output registers for that.

It'd be fine to proceed with these additional optimizations after moving
to ztex.  (Perhaps the optimizations can then be backported to the Zynq
on ZedBoard platform, just to have "final" speed figures for it.)

> > Are you still getting correct results on my ZedBoard only, but not on
> > yours (needing a lower core count for yours)?  And not on Parallella
> > board either?  I suspect the limited power / core voltage drop issue.
> > At 1.0 V core voltage, even a (peak) power usage of just 1.0 W means a
> > current of 1.0 A, so if e.g. a PCB trace has impedance of 0.1 Ohm (I
> > think this is too high, but not unrealistic) we might have a voltage
> > drop of 0.1 V right there, and that's 10% of total.  That's not even
> > considering limitations of the voltage regulator.  (I am assuming that
> > there's no voltage sense going back from the FPGA to the voltage
> > regulator.  I think there is not.)
> 
> That's correct. I'm not getting correct results on my boards. I've tried
> using 12V/8A PSU instead of 12V/3A on ZedBoard but that didn't help.

It was good to try this, but it's no surprise it didn't help: we suspect
the problem is with lower voltage, higher current circuitry on the PCB.

> I'm
> also having problems with gcc crashes (every time it crashes on different
> file):
> "/usr/lib/gcc/arm-linux-gnueabihf/4.6/include/arm_neon.h:7348:1: internal
> compiler error: Bus error
> Please submit a full bug report,
> with preprocessed source if appropriate.
> See <file:///usr/share/doc/gcc-4.6/README.Bugs> for instructions.
> The bug is not reproducible, so it is likely a hardware or OS problem."

Do the gcc crashes happen only when you have your custom bitstream
loaded?  I suspect that it could be drawing more current even when not
in use (its clock is running anyway, right?), thereby lowering the core
voltage for the CPU even when you're merely compiling stuff (and not
having the CPU make use of the PL).

> > As discussed off-list, I think you should also proceed with ztex board.
> > You mentioned that the documentation wasn't of sufficient help for you
> > to get communication going, right?  If so, suggest that you work
> > primarily from working code examples, such as those for Bitcoin and
> > Litecoin mining, as well as with the vendor's SDK examples.
> 
> Somehow I missed link to the EZ-USB FX2 Technical Reference Manual. I found
> some answers there and I hope to find other answers in the code examples.

Sounds good.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.