Date: Fri, 26 Jul 2013 13:55:00 +0200 From: Katja Malvoni <kmalvoni@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Hi Alexander, On Thu, Jul 25, 2013 at 5:35 PM, Solar Designer <solar@...nwall.com> wrote: > Even if not, we can do a 32-bit store and we still save 1 cycle. Right > now, we have these two right shifts and two ANDs. We replace these four > with one 32-bit store (if we have to, but I think we don't - see above), > two LDRBs, and two IMULs or IMADDs (but these are free for us, since the > FPU would otherwise be idle). So that's 3 (or 2) non-free instructions > instead of 4. > I implemented only one shift right (tmp2) and reordered instructions so that dual-issue is possible but it's much slower (793 c/s) than lsr followed by and (976 c/s). In Epiphany Architecture Reference, p. 68, Table 27 says that Byte Internal Data Load stalls for 2 cycles independently of instruction sequence. It seems that we have 2 cycles penalty for every LDRB. Although these two cycles don't explain slowdown I get, it shouldn't be this big. On Thu, Jul 25, 2013 at 5:50 AM, Solar Designer <solar@...nwall.com> wrote: > On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote: > > | "imadd r44, r44, r46\n" \ > > With 3 in r46, this simply shifts r44 left by 2 bits, but "off-loading" > this operation to the FPU (which would otherwise be idle). Good. > However, note that we could also make use of the addition: put 4 (not 3) > in r46 (or rather in a compiler-allocated register, as I pointed out in > another message) and use something like: > > imadd r44, r20, r46 > > (but with the compiler-allocated register). > > > | "ldr r44, [r20, +r44]\n" \ > > This would then become: > > ldr r44, [r44] > > ... but I'd expect this version of code to run at the exact same speed > on current Epiphany, as I think there's no penalty for using the adder > in address calculation. The power consumption may be slightly lower, > though, since we'd be keeping this adder idle during this one cycle. > If I'm not mistaken, this would require moving pointer to S to r20 before every BF_ROUND. IMADD Rd, Rn, Rm does this: Rd = Rd + Rm*Rn. So instruction would look like this: imadd r20, r44, r46 And before next use of this instruction it would be necessary to put S back in r20. Katja Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.