Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 27 Jan 2013 19:22:19 -0500
From: Brian Wallace <>
Subject: Re: Proposed optimizations to pwsafe

Ok, I'll do those changes.  I haven't done much cuda/ocl coding in the
past, so it might take me a short while to get up to speed on what works
best, although I have a good background in C and hash cracking
optimization.  What kind of benchmarks are we getting on pwsafe-opencl vs
On Jan 27, 2013 11:37 AM, "Solar Designer" <> wrote:

> Brian, Lukas -
> On Sat, Jan 26, 2013 at 04:24:23PM -0500, Brian Wallace wrote:
> > For this, I used a search and replace for the code I wrote for the CPU
> > version, then did some manual optimization(as you can see I did leave
> room
> > for improvement).  The CPU version I copied from a tool I wrote called
> > Safe-Cracker, which I did manually unroll the code.
> I'd prefer that we use per-round #define macros for unrolling (like it's
> done e.g. in MD5_std.c and sha2.c), rather than directly put the round
> bodies into the function.
> So e.g. instead of:
>         w[2] += sigma1( w[0] ) + sigma0( w[3] );
>         f += Sigma1( c ) + Ch( c, d, e ) + 0x0fc19dc6 + ( (w[2]));
>         b += f;
>         f += Sigma0( g ) + Maj( g, h, a );
> we'd have one line invoking a macro e.g. like this:
>         R(0, g, h, a, b, c, d, e, f, 0x0fc19dc6)
> and the macro would be defined as:
> #define R(i, a, b, c, d, e, f, g, h, ac) \
>         w[i + 2] += sigma1(w[i]) + sigma0(w[i + 3]); \
>         h += Sigma1(e) + Ch(e, f, g) + ac + w[i + 2]; \
>         d += h; \
>         h += Sigma0(a) + Maj(a, b, c);
> For some of the rounds where additional optimizations are possible, the
> four lines may be written explicitly (with the optimizations made), but
> for most rounds we'd use the macro.  MD5_std.c uses this approach -
> I mean using the macros most of the time, but not always.
> Similarly, the final instance of SHA-256 outside of the "for (i = 0; i <
> salt->iterations; i++)" loop should either use a giant macro shared with
> one used inside the loop (so that we don't spell out almost the whole
> SHA-256 twice) or it may even be unoptimized (no unrolling done to it)
> or merged back into the iterations loop (then all rounds would be
> computed) since its speed doesn't matter much unless the iteration count
> is low (Password Safe 3 does not allow for less than 2048, so this is
> impossible).  While the early reject is theoretically nice, we're
> talking a speedup of less than 1/2048, or less than 0.05% - and maybe
> none at all or even a slowdown (as we might be exceeding L1 instruction
> cache size because of this extra code).  Thus, I think it'd be best to
> simply undo this optimization (include the final iteration in the loop).
> Overall, we'd make the source file maybe ~6x shorter while achieving the
> same level of performance.
> I wrote the above while looking at opencl/, but it
> probably applies to cuda/ as well, and partially to the CPU
> code as well.
> Thanks,
> Alexander


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ