Date: Sun, 27 Jan 2013 20:37:12 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Proposed optimizations to pwsafe Brian, Lukas - On Sat, Jan 26, 2013 at 04:24:23PM -0500, Brian Wallace wrote: > For this, I used a search and replace for the code I wrote for the CPU > version, then did some manual optimization(as you can see I did leave room > for improvement). The CPU version I copied from a tool I wrote called > Safe-Cracker, which I did manually unroll the code. I'd prefer that we use per-round #define macros for unrolling (like it's done e.g. in MD5_std.c and sha2.c), rather than directly put the round bodies into the function. So e.g. instead of: w += sigma1( w ) + sigma0( w ); f += Sigma1( c ) + Ch( c, d, e ) + 0x0fc19dc6 + ( (w)); b += f; f += Sigma0( g ) + Maj( g, h, a ); we'd have one line invoking a macro e.g. like this: R(0, g, h, a, b, c, d, e, f, 0x0fc19dc6) and the macro would be defined as: #define R(i, a, b, c, d, e, f, g, h, ac) \ w[i + 2] += sigma1(w[i]) + sigma0(w[i + 3]); \ h += Sigma1(e) + Ch(e, f, g) + ac + w[i + 2]; \ d += h; \ h += Sigma0(a) + Maj(a, b, c); For some of the rounds where additional optimizations are possible, the four lines may be written explicitly (with the optimizations made), but for most rounds we'd use the macro. MD5_std.c uses this approach - I mean using the macros most of the time, but not always. Similarly, the final instance of SHA-256 outside of the "for (i = 0; i < salt->iterations; i++)" loop should either use a giant macro shared with one used inside the loop (so that we don't spell out almost the whole SHA-256 twice) or it may even be unoptimized (no unrolling done to it) or merged back into the iterations loop (then all rounds would be computed) since its speed doesn't matter much unless the iteration count is low (Password Safe 3 does not allow for less than 2048, so this is impossible). While the early reject is theoretically nice, we're talking a speedup of less than 1/2048, or less than 0.05% - and maybe none at all or even a slowdown (as we might be exceeding L1 instruction cache size because of this extra code). Thus, I think it'd be best to simply undo this optimization (include the final iteration in the loop). Overall, we'd make the source file maybe ~6x shorter while achieving the same level of performance. I wrote the above while looking at opencl/pwsafe_kernel.cl, but it probably applies to cuda/pwsafe.cu as well, and partially to the CPU code as well. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.