john-dev - Re: Sayantan: Weekly Report #3

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120509183207.GB21737@openwall.com>
Date: Wed, 9 May 2012 22:32:07 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Sayantan: Weekly Report #3

Sayantan -

On Wed, May 09, 2012 at 11:12:36PM +0530, SAYANTAN DATTA wrote:
> What exactly is manual unrolling?

By "manual unrolling", which I don't like, I am referring to explicit
duplication of pieces of code in the source file.  In pbkdf2_kernel.cl
currently in magnum-jumbo, you have such manual 4x unroll in PBKDF2() -
listing 4 separate SHA-1 implementations one after another right in the
sources.  Well, maybe 2 of them are for HMAC, if so the unroll is 2x.
Is this right or am I missing something?

> Do you mean using #define

Yes, you could #define a single SHA-1 implementation, then use that
macro 4 times.  This could need to be compared against the compiler's
function inlining, though.  In C, sometimes one or the other produces
better code; I suspect this uncertainty could apply to OpenCL as well,
although I am not familiar with OpenCL.

(It appears that compilers do register allocation differently in these
two cases.  With macros, we force the compiler to do register allocation
for the entire thing at once.  With inline functions, we give the
compiler a hint/option to do register allocation per-function first,
then inline and adjust.)

> #ifdef/ifndef so that one doesn't have to change the codes manually?

For different target devices and OpenCL platforms, yes.

> What I learnt that single core BD is indeed very slow. Right now I have an
> openmp version of host codes which produce 83-84k real c/s.    I will
> commit it tonight or by tommorow morning.

Oh, do you mean that the on-CPU portion of MSCash2, which we're
executing sequentially with the on-GPU code (not in parallel yet), was
consuming a substantial portion of total time?  This is a bit surprising
since it should correspond to 1/10240 of total work for MSCash2.
If the GPU is 100 times faster than the CPU (realistic for non-optimal
non-vectorized code on the CPU), that's only 1%.

What speed were you getting without OpenMP for this portion?

While it's OK to add OpenMP support there, I think normally we'll be
making GPU-enabled JtR builds without OpenMP, so that we're consuming
just one CPU core per GPU and can use the rest of the CPU cores for
other stuff (other concurrent invocations of john).  It is wasteful to
have the entire CPU allocated to this task, where it will mostly just
wait for GPU anyway.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.