Date: Thu, 25 Jul 2013 07:50:32 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Katja - On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote: > | "imadd r44, r44, r46\n" \ With 3 in r46, this simply shifts r44 left by 2 bits, but "off-loading" this operation to the FPU (which would otherwise be idle). Good. However, note that we could also make use of the addition: put 4 (not 3) in r46 (or rather in a compiler-allocated register, as I pointed out in another message) and use something like: imadd r44, r20, r46 (but with the compiler-allocated register). > | "ldr r44, [r20, +r44]\n" \ This would then become: ldr r44, [r44] ... but I'd expect this version of code to run at the exact same speed on current Epiphany, as I think there's no penalty for using the adder in address calculation. The power consumption may be slightly lower, though, since we'd be keeping this adder idle during this one cycle. Now, how about trying to come up with a clever/crazy sequence of IMADD, IMSUB, IMUL, IADD, or/and ISUB to replace those right shifts? It is OK to replace one LSR with multiple of these instructions because we're otherwise mostly keeping the FPU idle. However, we'll need to implement interleaving in order to hide the increased latencies. Can the right shift by 22: y = x >> 22; be replaced with: y = (2 << 22) - x * 0xffbfffff; where 0xffbfffff is the multiplicative inverse for ((2 << 22) - 1). Would this work well enough for our purpose (where we only use 8 bits of the result, so don't need precision beyond that)? Note that, if needed, we can apply the AND mask before rather than after the simulated right shift - we have this flexibility. http://www.hackersdelight.org/magic.htm Another idea: rather than do: tmp4 = L >> 22; \ tmp4 &= 0x3FC; \ we can shift by 24 bits, which eliminates the need for a mask (we shift the 2 lower bits out, and the bits being shifted in from the left are all zeroes), and we can shift back by 2 bits using IMUL or IMADD (which we get for free, unlike the AND that we're saving). I think this is even better than the previous idea, and maybe you can still use the "magic" approach for the right shift by 14 and/or by 6 (where we can't avoid the AND)? Of course, all of this assumes plenty of parallellism, which we'll need to obtain by interleaving multiple bcrypt's. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.