Date: Sun, 27 Jan 2013 19:22:19 -0500 From: Brian Wallace <nightstrike9809@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Proposed optimizations to pwsafe Ok, I'll do those changes. I haven't done much cuda/ocl coding in the past, so it might take me a short while to get up to speed on what works best, although I have a good background in C and hash cracking optimization. What kind of benchmarks are we getting on pwsafe-opencl vs hashcat. On Jan 27, 2013 11:37 AM, "Solar Designer" <solar@...nwall.com> wrote: > Brian, Lukas - > > On Sat, Jan 26, 2013 at 04:24:23PM -0500, Brian Wallace wrote: > > For this, I used a search and replace for the code I wrote for the CPU > > version, then did some manual optimization(as you can see I did leave > room > > for improvement). The CPU version I copied from a tool I wrote called > > Safe-Cracker, which I did manually unroll the code. > > I'd prefer that we use per-round #define macros for unrolling (like it's > done e.g. in MD5_std.c and sha2.c), rather than directly put the round > bodies into the function. > > So e.g. instead of: > > w += sigma1( w ) + sigma0( w ); > f += Sigma1( c ) + Ch( c, d, e ) + 0x0fc19dc6 + ( (w)); > b += f; > f += Sigma0( g ) + Maj( g, h, a ); > > we'd have one line invoking a macro e.g. like this: > > R(0, g, h, a, b, c, d, e, f, 0x0fc19dc6) > > and the macro would be defined as: > > #define R(i, a, b, c, d, e, f, g, h, ac) \ > w[i + 2] += sigma1(w[i]) + sigma0(w[i + 3]); \ > h += Sigma1(e) + Ch(e, f, g) + ac + w[i + 2]; \ > d += h; \ > h += Sigma0(a) + Maj(a, b, c); > > For some of the rounds where additional optimizations are possible, the > four lines may be written explicitly (with the optimizations made), but > for most rounds we'd use the macro. MD5_std.c uses this approach - > I mean using the macros most of the time, but not always. > > Similarly, the final instance of SHA-256 outside of the "for (i = 0; i < > salt->iterations; i++)" loop should either use a giant macro shared with > one used inside the loop (so that we don't spell out almost the whole > SHA-256 twice) or it may even be unoptimized (no unrolling done to it) > or merged back into the iterations loop (then all rounds would be > computed) since its speed doesn't matter much unless the iteration count > is low (Password Safe 3 does not allow for less than 2048, so this is > impossible). While the early reject is theoretically nice, we're > talking a speedup of less than 1/2048, or less than 0.05% - and maybe > none at all or even a slowdown (as we might be exceeding L1 instruction > cache size because of this extra code). Thus, I think it'd be best to > simply undo this optimization (include the final iteration in the loop). > > Overall, we'd make the source file maybe ~6x shorter while achieving the > same level of performance. > > I wrote the above while looking at opencl/pwsafe_kernel.cl, but it > probably applies to cuda/pwsafe.cu as well, and partially to the CPU > code as well. > > Thanks, > > Alexander > Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.