Date: Sun, 21 Jul 2013 20:01:49 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Hi Katja, On Sun, Jul 21, 2013 at 03:40:30PM +0200, Katja Malvoni wrote: > Since Epiphany code became much smaller when I integrated it with JtR, I > tried using internal.ldf instead of fast.ldf and code fits in local memory. How do internal.ldf and fast.ldf differ? > Speed is 932 c/s (sometimes it's 934 c/s). So that's a speedup by more than 10% from 830 c/s that you had before. Where does this speedup come from? > BF_fmt port was slow because I > didn't copy salt in local memory, when I did that speed was 790 c/s. Than I > used internal.ldf and got speed of 932 c/s. Do you think the speedup is from basing your code on JtR's BF_std now, as opposed to the size-optimized revision of the code in musl? I doubt that this alone would make a 10% difference, considering that we had already unrolled the Blowfish rounds loop (and tried unrolling some others as well) in musl's code. There must be something else. Are you able to use internal.ldf on the musl-derived code? What speed are you getting? > If I try to interleave two > instances of bcrypt than code can't fit in local memory. You may reduce the amount of unrolling to make it fit. > At the moment, > interleaving two instances of bcrypt doesn't work, it fails self test on > get_hash(1). How are you testing it? Does it run from off-chip memory in this test? > Should I pursue this approach further or not? I am not familiar enough with Epiphany instruction latencies and dual-issue rules. We'd need to read the Epiphany Architecture Reference more closely in order to be able to estimate the possible benefits (or lack thereof) from bringing more parallellism down to instruction level. In fact, it _might_ be the case that more parallelism is beneficial when we use IADD and is not needed when we don't. Or, as an alternative to reading and estimating, we can just try - which is what you started doing. Combining both approaches (theoretical understanding and practical testing) may work best. And we should use IADD. Your poor results with it so far merely mean that you haven't tried hard enough to optimize that version of code. So maybe that's what you should start with. Your current 932 c/s on 16 cores translates to 58 c/s per core, and that's at 600 MHz. On ARM, also with gcc-generated code, we're getting 84 c/s on one core (or about 166 c/s on two cores with OpenMP), but it's clocked at 666 MHz, so it would be 75 c/s at 600 MHz. The difference between 58 c/s and 75 c/s is attributable to a combination of: 1. Worse instruction issue rate on Epiphany as compared to ARM. This is in part related to Epiphany being able to dual-issue fewer combinations of instruction types - e.g., there's no XOR on the FPU - and in part to our non-use of IADD. 2. Various out-of-inner-loop code inefficiencies, such as the memcpy(). We may deal with some of those by optimizing the code. 3. Communication overhead. 4. Scaling to higher core count, where the performance scaling may be slightly worse than linear. (Perhaps on 16-core ARM we would also see a cumulative speed number slightly smaller than 16*84 c/s - but only very slightly smaller.) We might be able to reduce the communication overhead and improve the scaling by implementing async transfers - perhaps have two inputs and two outputs for each core, so that it'd proceed to compute the other instance of bcrypt while the already computed result is being read by host and the new input is provided by host. (If we also implement interleaving, then it'd be four inputs and four outputs per core.) Here's another data point: the original Pentium at 120 MHz was capable of 20.4 c/s at bcrypt (at this same bcrypt cost setting) with hand-tuned assembly code. It was also a dual-issue CPU (although its integer ALUs were both fully capable, unlike Epiphany's). At 600 MHz, this would mean 102 c/s. Here's yet another data point: PA-RISC 7100LC at 80 MHz (as found in HP 9000 712/80 workstation) did 16.3 c/s (with compiler-generated code, although this figure is the highest of those produced by different code revisions and different compilers). It was also a dual-issue CPU (a fully capable RISC one). At 600 MHz, this would mean 122 c/s. We're only getting half that... so is Epiphany twice less capable at 32-bit integer computation than PA-RISC from 20 years ago was (not considering the clock rate and power usage difference for a moment)? I think it's definitely less capable (largely due to its focus on floating-point and many-core, although its integer performance can be remedied in future revisions), but perhaps by less than a factor of 2. So I don't buy it when you say that BF_ROUND and BF_encrypt are already optimal. Surely they look that way at first, and the compiler does a pretty good job - it's just that there ought to be ways to do better. Usage of IADD is one of those. It might also be possible to replace some left shift and addition (used for address calculation) with IMADD. If all additions performed for address calculation are currently embedded in load instructions, then bundling them with shifts instead might not help... or it might, in some cases: by changing which execution units remain available for other use at which times (IMADD is on the FPU, whereas shifts are not, so we're freeing up an execution unit for other use by using IMADD). There's more to explore here than what you see at first. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.