Date: Sat, 13 Jul 2013 20:41:24 +0200 From: Katja Malvoni <kmalvoni@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Hi Yaniv, Alexander, My assembly code doesn't produce correct results but it has same execution time (20.7 ms for all cores and cost 5) as the one that produces correct result and it uses add instruction. When using fadd/iadd execution time is 22.6 ms. Except that, I don't see any other way to optimize BF_ROUND - C code is very optimal, almost every line corresponds to one instruction (two that don't are "tmp3 += BF_INDEX(ctx->s.S, tmp4);" and "R ^= ctx->s.P[N + 1];"). In case of higher cost, my code is much slower (2403.999 ms vs 2182.041 msfor cost 12 and 9603.889 ms vs 8716.686 ms for cost 14) - I didn't take care about pipeline structure and hazards and I don't think I'll be able to change order of instructions better than compiler did it. Yaniv, Epiphany Architecture Reference, p.69: "The branch prediction mechanism used by the CPU assumes that the branch was not taken. There is no penalty for branches not taken. For branches that are taken, there is a three-cycle constant penalty." - is this true for loops as well? If yes, in case of BF_encrypt this means penalty of (3 cycles * 1042) * 2^cost. Can that be avoided somehow? On Thu, Jun 27, 2013 at 5:08 PM, Solar Designer <solar@...nwall.com> wrote: > [...] > We may try two things: > > 1. Interleave two instances of bcrypt. > > and/or > > 2. Rewrite this function in assembly. The compiler-generated code does > look suboptimal. > [...] > I'll switch to other approach - interleaving two instances of bcrypt and I'll try to use Dual-Issue Scheduling Rules. Best regards, Katja Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.