john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EaD-bHroMy+FbVp9tX05hxa3e-ofWNaMD3E9SVyQ6WyhgKdA@mail.gmail.com>
Date: Sun, 21 Jul 2013 19:21:27 +0200
From: Katja Malvoni <kmalvoni@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

On Sun, Jul 21, 2013 at 6:01 PM, Solar Designer <solar@...nwall.com> wrote:

> Hi Katja,
>
> On Sun, Jul 21, 2013 at 03:40:30PM +0200, Katja Malvoni wrote:
> > Since Epiphany code became much smaller when I integrated it with JtR, I
> > tried using internal.ldf instead of fast.ldf and code fits in local
> memory.
>
> How do internal.ldf and fast.ldf differ?
>

In internal.ldf all code is in local memory, in fast.ldf user code and
stack are placed in local memory while standard libraries are in external
SRAM.


> > Speed is 932 c/s (sometimes it's 934 c/s).
>
> So that's a speedup by more than 10% from 830 c/s that you had before.
> Where does this speedup come from?
>

It comes from placing standard libraries in local memory instead of
external SRAM. I only changed that one word in Makefile to get that speedup.


>
> > BF_fmt port was slow because I
> > didn't copy salt in local memory, when I did that speed was 790 c/s.
> Than I
> > used internal.ldf and got speed of 932 c/s.
>
> Do you think the speedup is from basing your code on JtR's BF_std now,
> as opposed to the size-optimized revision of the code in musl?  I doubt
> that this alone would make a 10% difference, considering that we had
> already unrolled the Blowfish rounds loop (and tried unrolling some
> others as well) in musl's code.  There must be something else.


> Are you able to use internal.ldf on the musl-derived code?  What speed
> are you getting?
>

Sorry for not being clear, "Speed is 932 c/s (sometimes it's 934 c/s)." is
for musl-derived code. Now both implementations have the same speed.


> > If I try to interleave two
> > instances of bcrypt than code can't fit in local memory.
>
> You may reduce the amount of unrolling to make it fit.
>
> > At the moment,
> > interleaving two instances of bcrypt doesn't work, it fails self test on
> > get_hash[0](1).
>
> How are you testing it?  Does it run from off-chip memory in this test?
>

bcrypt code is in local memory, standard library code is in external
memory. For now I'll leave it that way and just focus on get it working.

[...]
>

> 2. Various out-of-inner-loop code inefficiencies, such as the memcpy().
> We may deal with some of those by optimizing the code.
>

This is not a problem any more. Since memcpy() code is in local memory now,
it takes 0.007 ms to copy initial BF_ctx structure.

[...]
>
> So I don't buy it when you say that BF_ROUND and BF_encrypt are already
> optimal.  Surely they look that way at first, and the compiler does a
> pretty good job - it's just that there ought to be ways to do better.
> Usage of IADD is one of those.  It might also be possible to replace
> some left shift and addition (used for address calculation) with IMADD.
> If all additions performed for address calculation are currently
> embedded in load instructions, then bundling them with shifts instead
> might not help... or it might, in some cases: by changing which
> execution units remain available for other use at which times (IMADD is
> on the FPU, whereas shifts are not, so we're freeing up an execution
> unit for other use by using IMADD).  There's more to explore here than
> what you see at first.
>

I agree now, I was looking at whole thing from completely wrong
perspective. I didn't see the fact that using IADD doesn't mean it will be
faster than ADD but it means IALU will be free for other instructions.

Katja

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.