Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 13 Jul 2013 20:41:24 +0200
From: Katja Malvoni <>
Subject: Re: Parallella: bcrypt

Hi Yaniv, Alexander,

My assembly code doesn't produce correct results but it has same execution
time (20.7 ms for all cores and cost 5) as the one that produces correct
result and it uses add instruction. When using fadd/iadd execution time is
22.6 ms.
Except that, I don't see any other way to optimize BF_ROUND - C code is
very optimal, almost every line corresponds to one instruction (two that
don't are "tmp3 += BF_INDEX(ctx->s.S[0], tmp4);" and "R ^= ctx->s.P[N +
In case of higher cost, my code is much slower (2403.999 ms vs
2182.041 msfor cost 12 and 9603.889 ms vs 8716.686
ms for cost 14) - I didn't take care about pipeline structure and hazards
and I don't think I'll be able to change order of instructions better than
compiler did it.

Yaniv, Epiphany Architecture Reference, p.69: "The branch prediction
mechanism used by the CPU assumes that the branch was not taken. There is
no penalty for branches not taken. For branches that are taken, there is a
three-cycle constant penalty." - is this true for loops as well? If yes, in
case of BF_encrypt this means penalty of (3 cycles * 1042) * 2^cost. Can
that be avoided somehow?

On Thu, Jun 27, 2013 at 5:08 PM, Solar Designer <> wrote:

> [...]
> We may try two things:
> 1. Interleave two instances of bcrypt.
> and/or
> 2. Rewrite this function in assembly.  The compiler-generated code does
> look suboptimal.
> [...]

I'll switch to other approach - interleaving two instances of bcrypt and
I'll try to use Dual-Issue Scheduling Rules.

Best regards,


Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.