john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130721002541.GA8765@openwall.com>
Date: Sun, 21 Jul 2013 04:25:41 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja,

On Wed, Jul 17, 2013 at 07:40:17PM +0200, Katja Malvoni wrote:
> On Wed, Jul 17, 2013 at 7:17 PM, Katja Malvoni <kmalvoni@...il.com> wrote:
> > I was surprised by that as well, I measured it again, 2.5 ms. I'm
> > measuring clock ticks with Epiphany timers (assuming 600 MHz), I start
> > timer after declarations in BF_crypt and I stop it before entering do{ ...
> > }while(--count);
> > I was passing pointers to shared buffer when calling BF_crypt(), I tried
> > copying data from shared buffer into local variables, it's even slower -
> > 809 c/s
> 
> It comes from copying initial S box to data structure in BF_crypt() (takes
> around 2.3 ms). If S box is not copied than it must be transferred before
> each BF_crypt call. Initial S box is transferred when loading *.srec file.
> I think this is cheaper than transferring it from host to Epiphany per
> every computed hash.

Yes, I also think that copying within local memory is faster.  However,
we may want to optimize the memcpy().  The existing implementation is
too generic - it doesn't use the 64-bit dual-register load/store
instructions, it has little unrolling, and it include support for sizes
that are not a multiple of 4.  (This is from a quick glance at
"e-objdump -d parallella_e_bcrypt.elf".)  Can you try creating a simpler
specialized implementation instead, which would use the ldrd/strd insns
and greater unrolling (e.g., 32 ldrd/strd pairs times 16 loop iterations,
for a total of 512 x 64-bit data copies, or 4096 bytes total)?  Also,
re-order the instructions such that the store is not attempted
immediately after its corresponding load, to hide its latency - e.g.:

mov r8,16
start:
ldrd r0,[r6]
ldrd r2,[r6,+1]
ldrd r4,[r6,+2]
strd r0,[r7]
strd r2,[r7,+1]
strd r4,[r7,+2]
[... repeat 9 more times using cpp macros, supplying different offsets ...]
ldrd r0,[r6,+30]
ldrd r2,[r6,+31]
strd r0,[r7,+30]
strd r2,[r7,+31]
add r6,r6,0x100
add r7,r7,0x100
sub r8,r8,1
bne start

(totally untested!)

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.