Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 21 Jul 2013 19:21:27 +0200
From: Katja Malvoni <>
Subject: Re: Parallella: bcrypt

On Sun, Jul 21, 2013 at 6:01 PM, Solar Designer <> wrote:

> Hi Katja,
> On Sun, Jul 21, 2013 at 03:40:30PM +0200, Katja Malvoni wrote:
> > Since Epiphany code became much smaller when I integrated it with JtR, I
> > tried using internal.ldf instead of fast.ldf and code fits in local
> memory.
> How do internal.ldf and fast.ldf differ?

In internal.ldf all code is in local memory, in fast.ldf user code and
stack are placed in local memory while standard libraries are in external

> > Speed is 932 c/s (sometimes it's 934 c/s).
> So that's a speedup by more than 10% from 830 c/s that you had before.
> Where does this speedup come from?

It comes from placing standard libraries in local memory instead of
external SRAM. I only changed that one word in Makefile to get that speedup.

> > BF_fmt port was slow because I
> > didn't copy salt in local memory, when I did that speed was 790 c/s.
> Than I
> > used internal.ldf and got speed of 932 c/s.
> Do you think the speedup is from basing your code on JtR's BF_std now,
> as opposed to the size-optimized revision of the code in musl?  I doubt
> that this alone would make a 10% difference, considering that we had
> already unrolled the Blowfish rounds loop (and tried unrolling some
> others as well) in musl's code.  There must be something else.

> Are you able to use internal.ldf on the musl-derived code?  What speed
> are you getting?

Sorry for not being clear, "Speed is 932 c/s (sometimes it's 934 c/s)." is
for musl-derived code. Now both implementations have the same speed.

> > If I try to interleave two
> > instances of bcrypt than code can't fit in local memory.
> You may reduce the amount of unrolling to make it fit.
> > At the moment,
> > interleaving two instances of bcrypt doesn't work, it fails self test on
> > get_hash[0](1).
> How are you testing it?  Does it run from off-chip memory in this test?

bcrypt code is in local memory, standard library code is in external
memory. For now I'll leave it that way and just focus on get it working.


> 2. Various out-of-inner-loop code inefficiencies, such as the memcpy().
> We may deal with some of those by optimizing the code.

This is not a problem any more. Since memcpy() code is in local memory now,
it takes 0.007 ms to copy initial BF_ctx structure.

> So I don't buy it when you say that BF_ROUND and BF_encrypt are already
> optimal.  Surely they look that way at first, and the compiler does a
> pretty good job - it's just that there ought to be ways to do better.
> Usage of IADD is one of those.  It might also be possible to replace
> some left shift and addition (used for address calculation) with IMADD.
> If all additions performed for address calculation are currently
> embedded in load instructions, then bundling them with shifts instead
> might not help... or it might, in some cases: by changing which
> execution units remain available for other use at which times (IMADD is
> on the FPU, whereas shifts are not, so we're freeing up an execution
> unit for other use by using IMADD).  There's more to explore here than
> what you see at first.

I agree now, I was looking at whole thing from completely wrong
perspective. I didn't see the fact that using IADD doesn't mean it will be
faster than ADD but it means IALU will be free for other instructions.


Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.