john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130713190947.GA25287@openwall.com>
Date: Sat, 13 Jul 2013 23:09:47 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja,

On Sat, Jul 13, 2013 at 08:41:24PM +0200, Katja Malvoni wrote:
> My assembly code doesn't produce correct results but it has same execution
> time (20.7 ms for all cores and cost 5) as the one that produces correct
> result and it uses add instruction. When using fadd/iadd execution time is
> 22.6 ms.

Oh, these times are much better than what you reported before.  IIRC,
previously we were unable to get much better than 30 ms - and now you're
saying you're down to 20 ms already, which means almost 50 c/s.  Your
20.7 ms should give us 773 c/s on 16-core Epiphany, which is roughly the
speed of one modern x86 CPU core - but at much lower power usage.  The
64-core Epiphany is thus comparable to Core 2 Quad from a few years ago -
but again, at much lower power usage.

You just need to implement this into a working JtR format now, and do so
in a way that would let the implementation scale at least to 64 cores.

> Except that, I don't see any other way to optimize BF_ROUND - C code is
> very optimal, almost every line corresponds to one instruction (two that
> don't are "tmp3 += BF_INDEX(ctx->s.S[0], tmp4);" and "R ^= ctx->s.P[N +
> 1];").
> In case of higher cost, my code is much slower (2403.999 ms vs
> 2182.041 msfor cost 12 and 9603.889 ms vs 8716.686
> ms for cost 14) - I didn't take care about pipeline structure and hazards
> and I don't think I'll be able to change order of instructions better than
> compiler did it.

Using these numbers, I get:

2404/20.7/128 = 91% efficiency
8717/20.7/512 = 82% efficiency

These are somewhat inconsistent - I would have expected to see almost
the same efficiency percentage from these two, since cost 12 is already
high enough.

In this context, by efficiency I mean CPU time spent in bcrypt's inner
loop vs. total CPU time spent.  (Of course, there are more aspects as
well - such as the inner loop itself still allowing for optimizations -
but the purpose of this "high cost" test was to measure this aspect.)

> Yaniv, Epiphany Architecture Reference, p.69: "The branch prediction
> mechanism used by the CPU assumes that the branch was not taken. There is
> no penalty for branches not taken. For branches that are taken, there is a
> three-cycle constant penalty." - is this true for loops as well? If yes, in
> case of BF_encrypt this means penalty of (3 cycles * 1042) * 2^cost. Can
> that be avoided somehow?

While Yaniv correctly pointed out the hardware loop feature, I suggest
that we don't bother.  We're able to unroll the loop enough that this
penalty is negligible.

> I'll switch to other approach - interleaving two instances of bcrypt and
> I'll try to use Dual-Issue Scheduling Rules.

OK.

Do you feel your work on getting JtR integration working correctly is
still pending Yaniv's help?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.