Date: Mon, 14 Apr 2014 17:39:32 +0200 From: Katja Malvoni <kmalvoni@...il.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt ᐧ On 14 April 2014 13:53, Solar Designer <solar@...nwall.com> wrote: > > > > It makes sense to start by running some more benchmarks, though: what > > > speeds are you getting for 1 core (in FPGA), for the 2-cycle and > 4-cycle > > > versions? What speeds are you getting for $2a$08 (reduces relative > > > cost of host's computation by a factor of 8 compared to $2a$05)? > For 1 core 2-cycles it's 81.4 c/s and 42.9 c/s for 4-cycles (no computation overlap). > > The test fails on my ZedBoard. > > Do you mean the 2-cycle version doesn't work on your ZedBoard at all, > even for reduced core count? > I haven't tried reduced core count yet - too much time to generate bitstream for ~60 cores and I needed Xilinx tools for other things (I'll try it later today). All 70 cores do the computation but host reads only relevant data (i.e. only first core) and this test failed on cmp_all. > What about your Parallella board? > Somewhat worse than ZedBoard. Less cores work. On ZedBoard, when cracking pw-bcrypt-2salt-50k, around 48000 hashes are cracked while on Parallella the number is lower ~46500. On the Zed system, all of them are cracked. > > For 4-cycle version, $2a$08 speed is 425 c/s and $2a$12 speed is 32 c/s > > (tested on my ZedBoard, self test passes but not all instances return > > correct results). > > These are good speeds. 425 c/s at $2a$08 would be 3400 c/s at $2a$05 if > we fully avoided the extra overhead. 32 c/s at $2a$12 would be 4096 c/s > at $2a$05. > > Is the 2-cycle version twice faster than that? If so, it already > outperforms quad-core x86 CPUs at these higher cost settings. > I apologize, these numbers are for 2-cycles version tested on my ZedBoard. But things are better on the Zed system: 38.59 c/s for cost 12 and 436.8 c/s for cost 8. > > cycle 0: compute tmp; initiate 2 S-box lookups > > cycle 1: compute new R, L; initiate 2 S-box lookups; initiate P-box > lookup > > With two dual-port BRAM blocks per core, you're able to initiate a total > of 8 reads per 2 clock cycles. You're making use of 5 out of 8. Are > the remaining ports already in use for initialization? In other words, > if you make more use of them, would you incur extra MUXes? > No, the remaining ports are unused, I don't have anything to use them for. I need only one read from the "tiny" BRAM (P-box read). I'm not sure about extra MUXes but probably not. A port already has MUXes because it's used by host to write/read data. But port B is used only by bcrypt so I should be able to use it again without noticeable area penalty. > I think it might make sense to interleave multiple instances of bcrypt > per core until you're making full use of all BRAM ports for computation. > I agree. I'll start working on that. > > I'm > > also having problems with gcc crashes (every time it crashes on different > > file): > > "/usr/lib/gcc/arm-linux-gnueabihf/4.6/include/arm_neon.h:7348:1: internal > > compiler error: Bus error > > Please submit a full bug report, > > with preprocessed source if appropriate. > > See <file:///usr/share/doc/gcc-4.6/README.Bugs> for instructions. > > The bug is not reproducible, so it is likely a hardware or OS problem." > > Do the gcc crashes happen only when you have your custom bitstream > loaded? I suspect that it could be drawing more current even when not > in use (its clock is running anyway, right?), thereby lowering the core > voltage for the CPU even when you're merely compiling stuff (and not > having the CPU make use of the PL). > I don't recall seeing them when using default bitstream. With 4-cycles and 60 cores bitstream this wasn't happening. But with 67+ cores it happens very often. Katja Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.