Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 1 May 2012 09:01:10 +0400
From: Solar Designer <>
Subject: Re: Sayantan :Weekly Report #2

Hi Sayantan,

On Tue, May 01, 2012 at 10:06:07AM +0530, SAYANTAN DATTA wrote:
> Accomplishments:
> 1. Implemented a function to find the optimum local work group size in
> opencl-mscash2.

Is this in magnum-jumbo?  It seems not.  If so, where is it?

> 2. The following are not an accomplishment in strict sense but more of a
> kind of experiment and the results should be helpful in future:
>     a. rotate() function caused huge performance drop on 7970 on bull.

This is puzzling.  Perhaps you can try reviewing the generated code (IL
or native) to figure out the cause of the performance drop?  In general,
I think we (as a team) should learn to do that.

> I replaced the rotate function with a macro resulting in nearly 50%
> performance improvement.  With this change the opencl-mscash2 produces
> around        72-73K real/s.
>     b. other cards(4890,570) I tested don't have such issues with rotate().
>     c. Vectorization of the code caused small(2-3%) performance drop on
> 7970 and 570.

OK.  The code currently in magnum-jumbo only achieves 36k c/s on 7970:

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=mscash2-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Raw:    35754 c/s real, 50592 c/s virtual

> Priorities:
> 1. More experimentation and optimization.


One task I think you could approach slightly later is trying to
implement and optimize Eksblowfish on GPU.  As discussed before, we
expect it to be slow, but it'd be useful to have some hard data to prove
this - or maybe disprove it (unlikely), and to have some OpenCL code
(and maybe CUDA as well) that we could run on future GPUs easily as they
become available.  Specifically, this may be helpful for design of
future password hashing methods.  Additionally, this OpenCL code may
happen to be readily capable of making use of AVX2's VSIB addressing
with Intel's OpenCL SDK - if so, it may actually be faster (on those
future CPUs) than the existing CPU code for bcrypt, until we implement
proper AVX2 code more directly (perhaps with intrinsics).

I had mentioned this task in GSoC student selection context before, but
it may also be approached outside of that context and with slightly
different goals as above.  In that way, it will actually be useful even
if the implementation is indeed slower than the current CPU code on
current hardware.



Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ