john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130721160149.GA10709@openwall.com>
Date: Sun, 21 Jul 2013 20:01:49 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Hi Katja,

On Sun, Jul 21, 2013 at 03:40:30PM +0200, Katja Malvoni wrote:
> Since Epiphany code became much smaller when I integrated it with JtR, I
> tried using internal.ldf instead of fast.ldf and code fits in local memory.

How do internal.ldf and fast.ldf differ?

> Speed is 932 c/s (sometimes it's 934 c/s).

So that's a speedup by more than 10% from 830 c/s that you had before.
Where does this speedup come from?

> BF_fmt port was slow because I
> didn't copy salt in local memory, when I did that speed was 790 c/s. Than I
> used internal.ldf and got speed of 932 c/s.

Do you think the speedup is from basing your code on JtR's BF_std now,
as opposed to the size-optimized revision of the code in musl?  I doubt
that this alone would make a 10% difference, considering that we had
already unrolled the Blowfish rounds loop (and tried unrolling some
others as well) in musl's code.  There must be something else.

Are you able to use internal.ldf on the musl-derived code?  What speed
are you getting?

> If I try to interleave two
> instances of bcrypt than code can't fit in local memory.

You may reduce the amount of unrolling to make it fit.

> At the moment,
> interleaving two instances of bcrypt doesn't work, it fails self test on
> get_hash[0](1).

How are you testing it?  Does it run from off-chip memory in this test?

> Should I pursue this approach further or not?

I am not familiar enough with Epiphany instruction latencies and
dual-issue rules.  We'd need to read the Epiphany Architecture Reference
more closely in order to be able to estimate the possible benefits (or
lack thereof) from bringing more parallellism down to instruction level.
In fact, it _might_ be the case that more parallelism is beneficial when
we use IADD and is not needed when we don't.  Or, as an alternative to
reading and estimating, we can just try - which is what you started
doing.  Combining both approaches (theoretical understanding and
practical testing) may work best.

And we should use IADD.  Your poor results with it so far merely mean
that you haven't tried hard enough to optimize that version of code.
So maybe that's what you should start with.

Your current 932 c/s on 16 cores translates to 58 c/s per core, and
that's at 600 MHz.  On ARM, also with gcc-generated code, we're getting
84 c/s on one core (or about 166 c/s on two cores with OpenMP), but it's
clocked at 666 MHz, so it would be 75 c/s at 600 MHz.  The difference
between 58 c/s and 75 c/s is attributable to a combination of:

1. Worse instruction issue rate on Epiphany as compared to ARM.  This is
in part related to Epiphany being able to dual-issue fewer combinations
of instruction types - e.g., there's no XOR on the FPU - and in part to
our non-use of IADD.

2. Various out-of-inner-loop code inefficiencies, such as the memcpy().
We may deal with some of those by optimizing the code.

3. Communication overhead.

4. Scaling to higher core count, where the performance scaling may be
slightly worse than linear.  (Perhaps on 16-core ARM we would also see a
cumulative speed number slightly smaller than 16*84 c/s - but only very
slightly smaller.)

We might be able to reduce the communication overhead and improve the
scaling by implementing async transfers - perhaps have two inputs and
two outputs for each core, so that it'd proceed to compute the other
instance of bcrypt while the already computed result is being read by
host and the new input is provided by host.  (If we also implement
interleaving, then it'd be four inputs and four outputs per core.)

Here's another data point: the original Pentium at 120 MHz was capable
of 20.4 c/s at bcrypt (at this same bcrypt cost setting) with hand-tuned
assembly code.  It was also a dual-issue CPU (although its integer ALUs
were both fully capable, unlike Epiphany's).  At 600 MHz, this would
mean 102 c/s.

Here's yet another data point: PA-RISC 7100LC at 80 MHz (as found in HP
9000 712/80 workstation) did 16.3 c/s (with compiler-generated code,
although this figure is the highest of those produced by different code
revisions and different compilers).  It was also a dual-issue CPU (a
fully capable RISC one).  At 600 MHz, this would mean 122 c/s.  We're
only getting half that... so is Epiphany twice less capable at 32-bit
integer computation than PA-RISC from 20 years ago was (not considering
the clock rate and power usage difference for a moment)?  I think it's
definitely less capable (largely due to its focus on floating-point and
many-core, although its integer performance can be remedied in future
revisions), but perhaps by less than a factor of 2.

So I don't buy it when you say that BF_ROUND and BF_encrypt are already
optimal.  Surely they look that way at first, and the compiler does a
pretty good job - it's just that there ought to be ways to do better.
Usage of IADD is one of those.  It might also be possible to replace
some left shift and addition (used for address calculation) with IMADD.
If all additions performed for address calculation are currently
embedded in load instructions, then bundling them with shifts instead
might not help... or it might, in some cases: by changing which
execution units remain available for other use at which times (IMADD is
on the FPU, whereas shifts are not, so we're freeing up an execution
unit for other use by using IMADD).  There's more to explore here than
what you see at first.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.