john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+EaD-YMVJyfB4EaSP3cbJjn86CJfT6kW2HkDC6fdizg+R2OfA@mail.gmail.com>
Date: Fri, 26 Jul 2013 13:55:00 +0200
From: Katja Malvoni <kmalvoni@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Hi Alexander,

On Thu, Jul 25, 2013 at 5:35 PM, Solar Designer <solar@...nwall.com> wrote:

> Even if not, we can do a 32-bit store and we still save 1 cycle.  Right
> now, we have these two right shifts and two ANDs.  We replace these four
> with one 32-bit store (if we have to, but I think we don't - see above),
> two LDRBs, and two IMULs or IMADDs (but these are free for us, since the
> FPU would otherwise be idle).  So that's 3 (or 2) non-free instructions
> instead of 4.
>

I implemented only one shift right (tmp2) and reordered instructions so
that dual-issue is possible but it's much slower (793 c/s) than lsr
followed by and (976 c/s). In Epiphany Architecture Reference, p. 68, Table
27 says that Byte Internal Data Load stalls for 2 cycles independently of
instruction sequence. It seems that we have 2 cycles penalty for every
LDRB. Although these two cycles don't explain slowdown I get, it shouldn't
be this big.

On Thu, Jul 25, 2013 at 5:50 AM, Solar Designer <solar@...nwall.com> wrote:

> On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote:
> > |         "imadd r44, r44, r46\n" \
>
> With 3 in r46, this simply shifts r44 left by 2 bits, but "off-loading"
> this operation to the FPU (which would otherwise be idle).  Good.
> However, note that we could also make use of the addition: put 4 (not 3)
> in r46 (or rather in a compiler-allocated register, as I pointed out in
> another message) and use something like:
>
>         imadd r44, r20, r46
>
> (but with the compiler-allocated register).
>
> > |         "ldr r44, [r20, +r44]\n" \
>
> This would then become:
>
>         ldr r44, [r44]
>
> ... but I'd expect this version of code to run at the exact same speed
> on current Epiphany, as I think there's no penalty for using the adder
> in address calculation.  The power consumption may be slightly lower,
> though, since we'd be keeping this adder idle during this one cycle.
>

If I'm not mistaken, this would require moving pointer to S[3] to r20
before every BF_ROUND. IMADD Rd, Rn, Rm does this: Rd = Rd + Rm*Rn. So
instruction would look like this: imadd r20, r44, r46
And before next use of this instruction it would be necessary to put S[3]
back in r20.

Katja

Content of type "text/html" skipped

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.