john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130725143830.GA14712@openwall.com>
Date: Thu, 25 Jul 2013 18:38:30 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

On Thu, Jul 25, 2013 at 01:26:19PM +0200, Katja Malvoni wrote:
> On Thu, Jul 25, 2013 at 4:02 AM, Solar Designer <solar@...nwall.com> wrote:
> 
> > The code itself mostly looks good to me (including your delayed use of
> > results from IMADD and IADD).  Shouldn't you re-order these two, though? -
> >
> > |         "eor %0, %0, r27\n" \
> > |         "eor r23, r22, r23\n" \
> >
> > because r22 is loaded sooner than r27?  Well, maybe this makes no
> > difference on the current chip, but it might if load's latency is
> > increased in a future revision of Epiphany.
> 
> If I reorder them than there is no 4 cycles separation between iadd r23,
> r24, r23 and eor r23, r22, r23 and that's required for dual-issue. In that
> case, speed is 924 c/s.

Oh, I thought it was sufficient to have 3 instructions inbetween, not 4
(so that we'd use the IADD result on the 4th cycle).  Apparently not.

> > Now, here's an issue/bug in the above: you rely on registers being
> > preserved across multiple pieces of inline asm, but gcc does not
> > guarantee you that.  Also, you don't declare which registers you
> > clobber.  To fix this, your BF_ROUND should not be the entire __asm__
> > block, but rather just a portion of the string you put inside such
> > block.  The asm block itself, with proper confession on what registers
> > you clobber, should be in the BF_encrypt function.
> 
> When I did that, e-gcc unnecessary used one more register to store L and
> register being used changed for every BF_ROUND. And than there were 16
> unnecessary mov instructions. So I removed clobbered registers list. I
> added them back now, speed drops from 976 c/s to 970 c/s.

I haven't looked at the new code revision yet, but it sounds like you
misunderstood me.  I was referring to the case of having all 16 rounds
within one asm block, roughly like this:

#define BF_2ROUNDS \
	"asm code string " \
	"corresponding " \
	"to two rounds here"

then in the function:

	__asm__ (
		"some init code"
		BF_2ROUNDS
		BF_2ROUNDS
		BF_2ROUNDS
		BF_2ROUNDS
		BF_2ROUNDS
		BF_2ROUNDS
		BF_2ROUNDS
		BF_2ROUNDS
		"some final code"
		: ...
		: ...
		: ...
	);

So the 2 rounds' strings would all be concatenated into one string by
the compiler.  With this, there's no room for the compiler to get too
smart and generate the extra MOVs you mention.

> On Thu, Jul 25, 2013 at 7:18 AM, Solar Designer <solar@...nwall.com> wrote:
> >  On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote:
> > > |         "ldr r27, [r45], 0x1\n" \
> >
> > I guess this is read from the P-box.  You should be able to use ldrd
> > here, and thus only have this instruction in every other round (a total
> > of 9 instructions to read the 18 elements).  Don't forget that ldrd
> > needs an even-numbered first register.
> 
> This instruction ensures 4 cycles separation between IADD r23, r24, r23 and
> EOR r23, r22, r23, if I remove it, I'll lose dual-issue in one round. But
> I'll try to reorder instructions so that dual-issue stays.

Oh, you'll avoid this issue when you implement interleaving.  We have
enough registers to pre-load two instances of P (need 36 registers for
this, and still have plenty left for L, R, and the temporaries).

BTW, you may try putting the most frequently used variables into
registers r0 through r7, since instructions with only those low-numbered
registers are smaller (16 bits).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.