john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130731020647.GA22858@openwall.com>
Date: Wed, 31 Jul 2013 06:06:47 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja,

On Tue, Jul 30, 2013 at 10:39:14PM +0200, Katja Malvoni wrote:
> On Tue, Jul 30, 2013 at 4:19 PM, Solar Designer <solar@...nwall.com> wrote:
> > Here's another idea: replace the AND, not the right shift.  You can
> > replace one AND with two IMULs - e.g., to extract the byte at bit offset
> > 16, you can IMUL by 0x100, then right shift by 24, then IMUL by 4 (to
> > get the 8 data bits into bit offsets 2 to 9 as we need for a load).  Can
> > you have both IMULs for free with 2x interleave, or would you have to go
> > for 3x?  In the latter case, you wouldn't be able to preload one of
> > three P arrays, which would defeat the purpose of this new trick for one
> > of two byte extracts - but we'd nevertheless potentially save a cycle on
> > the other byte extract.
> 
> At first look, I think I can't have both for free. Mostly because result is
> used with other FPU instructions so gaps become too big.
> In case of 3x interleave I will be able to fully preload only one P array.
> I'll need more tmp registers (in worst case 4), 2 registers for L2 and R2,
> pointer to ctx3, 5 pointers to P and S arrays and one ptr for 3rd instance.

OK, let's stay with 2x interleave for now, although you could have
compensated for not preloading two of the three P's by using LDRD to
load two P elements at once within the loop, whereas the savings from
fewer ANDs are probably greater (each removed AND saves 1 cycle per
Blowfish round per instance, whereas each added LDRD costs only 0.5
cycles per Blowfish round per instance).

Perhaps you can try implementing the trick for the byte at bit offsets
16 to 23, where you can reuse the existing 0xff constant with IMADD?

> I have one question - when transferring only one struct, I'm not relying on
> order in which e_write() calls are executed (I'm assuming that data is
> written to SDRAM sequentially and since my code doesn't work if start array
> is not last variable in sturct, I believe my assumption is correct). I
> still haven't tested it enough times to conclude that occasional stall
> doesn't happen but so far I haven't seen it. With two structs I'd rely on
> order or I'd have to check that data is transferred before transferring
> start. I think that making sure data is transferred is less costly than
> unnecessary transfers, but I'd like your opinion on this before I start
> (normally I'd just start and see what happens but my first approach was to
> use struct with keys and salt inside the "data" struct but it didn't worked
> and I need to debug it (I checked offsets, equal on both sides) and I want
> to be sure I won't waste time).

I think it's good to stay with only one struct to transfer, but adjust
the transfer size to be as needed on a given occasion.  This does mean
that the salt will sometimes be transferred unnecessarily (whenever the
keys change, the salt will also have to be transferred), but that's OK.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.