john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EaD-Zi8NAYV6-1J0_rzAuSgw2yYJpiBbGUVDfzaPy=j84pvw@mail.gmail.com>
Date: Thu, 25 Jul 2013 13:26:19 +0200
From: Katja Malvoni <kmalvoni@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Hi Alexander,

On Thu, Jul 25, 2013 at 3:36 AM, Solar Designer <solar@...nwall.com> wrote:

> Hi Katja,
>
> On Wed, Jul 24, 2013 at 11:42:53PM +0200, Katja Malvoni wrote:
> > I made use of dual-issue, the speed I'm getting is 976 c/s when compiling
> > Epiphany code with -O2. If I compile with -O3 I get 979 c/s.
>
> This is nice.  This is for 1 instance of bcrypt per core per invocation,
> right?  I mean that there's no interleaving yet.
>

That's right, only one instance.


> Can you try interleaving two instances, perhaps with C code initially?
>

Ok, I will.

>
> > Code is in https://github.com/kmalvoni/JohnTheRipper/tree/master
>
> I took a look, and surprisingly (besides the pieces of inline asm) I
> noticed something unrelated: you seem to have inconsistent BF_binary
> sizes between Epiphany and host sides.  I thought you had addressed that
> already?  Maybe you forgot to commit?  Also, your host side code only
> checks 32 bits of the computed hash value, whereas you could check 64
> bits just as easily (so you should).
>

I had problems with my local github repo and I wasn't able to commit so I
edited files on GitHub online. That was a very bad idea... I forgot to
update host code and Makefile. I won't repeat this again and I apologize
for inconvenience.

On Thu, Jul 25, 2013 at 4:28 AM, Solar Designer <solar@...nwall.com> wrote:
I checked out, built, and tried to test this version of code.  The first
hurdle was the 2 vs. 6 size BF_binary discrepancy.  Because of it, the
program would just get stuck all the time.  Once I fixed it in my copy
of parallella_bf_fmt.c, I am getting:

solar@...aro-ubuntu-desktop:~/
>
> 2/JohnTheRipper/run$ ./parallella_john.sh -te -form=bcrypt-parallella
> Benchmarking: bcrypt-parallella, OpenBSD Blowfish ("$2a$05", 32
> iterations) [Parallella]... DONE
> Raw:    865 c/s real, 865 c/s virtual
>
> ... which is much less than what you said it would be.
>
> So perhaps you forgot to commit multiple changes?


This is because fast.ldf is used in Makefile instead of internal.ldf. Now
everything should work.

On Thu, Jul 25, 2013 at 4:02 AM, Solar Designer <solar@...nwall.com> wrote:

> The code itself mostly looks good to me (including your delayed use of
> results from IMADD and IADD).  Shouldn't you re-order these two, though? -
>
> |         "eor %0, %0, r27\n" \
> |         "eor r23, r22, r23\n" \
>
> because r22 is loaded sooner than r27?  Well, maybe this makes no
> difference on the current chip, but it might if load's latency is
> increased in a future revision of Epiphany.
>

If I reorder them than there is no 4 cycles separation between iadd r23,
r24, r23 and eor r23, r22, r23 and that's required for dual-issue. In that
case, speed is 924 c/s.


> Now, here's an issue/bug in the above: you rely on registers being
> preserved across multiple pieces of inline asm, but gcc does not
> guarantee you that.  Also, you don't declare which registers you
> clobber.  To fix this, your BF_ROUND should not be the entire __asm__
> block, but rather just a portion of the string you put inside such
> block.  The asm block itself, with proper confession on what registers
> you clobber, should be in the BF_encrypt function.
>

When I did that, e-gcc unnecessary used one more register to store L and
register being used changed for every BF_ROUND. And than there were 16
unnecessary mov instructions. So I removed clobbered registers list. I
added them back now, speed drops from 976 c/s to 970 c/s.

On Thu, Jul 25, 2013 at 7:18 AM, Solar Designer <solar@...nwall.com> wrote:

>  On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote:
> > |         "ldr r27, [r45], 0x1\n" \
>
> I guess this is read from the P-box.  You should be able to use ldrd
> here, and thus only have this instruction in every other round (a total
> of 9 instructions to read the 18 elements).  Don't forget that ldrd
> needs an even-numbered first register.
>

This instruction ensures 4 cycles separation between IADD r23, r24, r23 and
EOR r23, r22, r23, if I remove it, I'll lose dual-issue in one round. But
I'll try to reorder instructions so that dual-issue stays.

Katja

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.