|
Message-ID: <20120509183207.GB21737@openwall.com> Date: Wed, 9 May 2012 22:32:07 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Sayantan: Weekly Report #3 Sayantan - On Wed, May 09, 2012 at 11:12:36PM +0530, SAYANTAN DATTA wrote: > What exactly is manual unrolling? By "manual unrolling", which I don't like, I am referring to explicit duplication of pieces of code in the source file. In pbkdf2_kernel.cl currently in magnum-jumbo, you have such manual 4x unroll in PBKDF2() - listing 4 separate SHA-1 implementations one after another right in the sources. Well, maybe 2 of them are for HMAC, if so the unroll is 2x. Is this right or am I missing something? > Do you mean using #define Yes, you could #define a single SHA-1 implementation, then use that macro 4 times. This could need to be compared against the compiler's function inlining, though. In C, sometimes one or the other produces better code; I suspect this uncertainty could apply to OpenCL as well, although I am not familiar with OpenCL. (It appears that compilers do register allocation differently in these two cases. With macros, we force the compiler to do register allocation for the entire thing at once. With inline functions, we give the compiler a hint/option to do register allocation per-function first, then inline and adjust.) > #ifdef/ifndef so that one doesn't have to change the codes manually? For different target devices and OpenCL platforms, yes. > What I learnt that single core BD is indeed very slow. Right now I have an > openmp version of host codes which produce 83-84k real c/s. I will > commit it tonight or by tommorow morning. Oh, do you mean that the on-CPU portion of MSCash2, which we're executing sequentially with the on-GPU code (not in parallel yet), was consuming a substantial portion of total time? This is a bit surprising since it should correspond to 1/10240 of total work for MSCash2. If the GPU is 100 times faster than the CPU (realistic for non-optimal non-vectorized code on the CPU), that's only 1%. What speed were you getting without OpenMP for this portion? While it's OK to add OpenMP support there, I think normally we'll be making GPU-enabled JtR builds without OpenMP, so that we're consuming just one CPU core per GPU and can use the rest of the CPU cores for other stuff (other concurrent invocations of john). It is wasteful to have the entire CPU allocated to this task, where it will mostly just wait for GPU anyway. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.