Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 21 Jul 2013 04:25:41 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja,

On Wed, Jul 17, 2013 at 07:40:17PM +0200, Katja Malvoni wrote:
> On Wed, Jul 17, 2013 at 7:17 PM, Katja Malvoni <kmalvoni@...il.com> wrote:
> > I was surprised by that as well, I measured it again, 2.5 ms. I'm
> > measuring clock ticks with Epiphany timers (assuming 600 MHz), I start
> > timer after declarations in BF_crypt and I stop it before entering do{ ...
> > }while(--count);
> > I was passing pointers to shared buffer when calling BF_crypt(), I tried
> > copying data from shared buffer into local variables, it's even slower -
> > 809 c/s
> 
> It comes from copying initial S box to data structure in BF_crypt() (takes
> around 2.3 ms). If S box is not copied than it must be transferred before
> each BF_crypt call. Initial S box is transferred when loading *.srec file.
> I think this is cheaper than transferring it from host to Epiphany per
> every computed hash.

Yes, I also think that copying within local memory is faster.  However,
we may want to optimize the memcpy().  The existing implementation is
too generic - it doesn't use the 64-bit dual-register load/store
instructions, it has little unrolling, and it include support for sizes
that are not a multiple of 4.  (This is from a quick glance at
"e-objdump -d parallella_e_bcrypt.elf".)  Can you try creating a simpler
specialized implementation instead, which would use the ldrd/strd insns
and greater unrolling (e.g., 32 ldrd/strd pairs times 16 loop iterations,
for a total of 512 x 64-bit data copies, or 4096 bytes total)?  Also,
re-order the instructions such that the store is not attempted
immediately after its corresponding load, to hide its latency - e.g.:

mov r8,16
start:
ldrd r0,[r6]
ldrd r2,[r6,+1]
ldrd r4,[r6,+2]
strd r0,[r7]
strd r2,[r7,+1]
strd r4,[r7,+2]
[... repeat 9 more times using cpp macros, supplying different offsets ...]
ldrd r0,[r6,+30]
ldrd r2,[r6,+31]
strd r0,[r7,+30]
strd r2,[r7,+31]
add r6,r6,0x100
add r7,r7,0x100
sub r8,r8,1
bne start

(totally untested!)

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ