Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 10 Sep 2012 13:12:54 +0530
From: Sayantan Datta <>
Subject: Re: bitslice DES on GPU

On Mon, Sep 10, 2012 at 8:58 AM, Sayantan Datta <> wrote:

> On Wed, Sep 5, 2012 at 8:52 AM, Solar Designer <> wrote:
>> Hi Sayantan,
>> On Tue, Sep 04, 2012 at 11:04:45PM +0530, Sayantan Datta wrote:
>> > How do I test the LM hashes for the original cpu implementation ? Is it
>> > sufficient to set the LM parameter to 1 in the DES_bs_init() function ?
>> You should be doing whatever LM_fmt.c does.  That's probably not the
>> kind of answer you wanted, but I'm afraid I don't have a better one.
>> In short, yes, DES_bs_init() is called with 1 for LM, and proper
>> functions are used after that point.
>> Anyhow, I took a look at your code, and I tested it briefly to see where
>> the bottlenecks are.  Many things are done non-optimally, which I guess
>> you're aware of:
>> 1. DES_EXPAND = 1 is probably not an optimal choice (although it was/is
>> OK to try it as well).
>> 2. The expanded keys (and some other fields) should only exist on the
>> GPU side, yet you have space reserved for them in opencl_DES_bs_combined
>> on the CPU side as well.  This probably does not directly cost any CPU
>> time (unless I missed something), but it may result in less optimal use
>> of caches on the CPU (potentially more cache tag conflicts between
>> fields of opencl_DES_bs_combined that you do actually use).
>> 3. You're copying too much data to/from GPU.  For example, there's no
>> point in copying B to GPU, and there's usually no point in copying K
>> from GPU (although you do need to preserve the partially transposed and
>> expanded key bits across multiple calls for the same salt somehow).
>> There are other minor things to exclude from the copying as well.
>> 4. A lot of time is spent on the CPU side.  Some testing I did suggests
>> that it's around 33% (when run on bull's CPU and HD 7970).  You have
>> some extra copying on the CPU side as well.
>> 5. DES_bs_cmp_all() became slower than the usual CPU implementation
>> because it now uses 32-bit rather than 64-bit vector elements.
>> 6. You're using NVIDIA-friendly S-box expressions.  This results in
>> unneeded overhead when running on AMD GPUs.  Also, you don't have vsel()
>> defined to use bitselect() - you'll need to fix that when you switch to
>> proper S-box expressions for AMD.  You'll need to support both of these,
>> so that we can run reasonable benchmarks on both GPUs.
>> 7. You keep too many of the things in global memory.  With DES_EXPAND = 1,
>> you have to keep the expanded keys in global memory (but at least
>> they're read sequentially).  You don't have to keep B and E in global
>> memory.  BTW, what do you mean by trying to declare only some of the
>> struct fields __global?  I doubt that this is valid - either the entire
>> struct is in global memory or the entire struct is in local memory - no?
>> I'd expect that you'd need to split it into two structs if you have
>> these two kinds of fields.
>> I listed the above in arbitrary order.  Of the above, you may want to
>> start by fixing 6, 7, 3, 4 - maybe in this order.
>> To see how much of the slowness is from the extra copying and such
>> rather than from the inner loop being slow, I tried increasing the
>> iteration count from 25 to 725 and using this test hash:
>>         {"..X8NBuQ4l6uQ", ""},
>> It's not a valid DES-based crypt(3) hash because those use 25 iterations
>> (not configurable), but I generated it for my performance testing anyway.
>> With this, I got roughly twice higher speed (after multiplying by 29).
>> Thus, the "overhead" accounts for roughly 50% of the total running time
>> of the current code.  The code remains too slow even with the overhead
>> mostly removed (due to the much higher iteration count like this).
>> Last but not least, please add proper statements to the source files so
>> that your revisions of them are not misattributed to me.  Right now,
>> some of your modified files say that they've been written by me, which
>> is no longer entirely true.
>> Thanks,
>> Alexander
> I've made some modifications to fix the issues mentioned in 2,3,6,7.  Also
> using bitselect is causing performance drop on AMD GPU. I'll find out why.
> Regards
> Sayantan

Hi Alexander,

Here's the result of keeping B in local memory.

std2048@...l:~/magnum-jumbo-new/run$ ./john -te -fo=des-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Compilation log: "/tmp/", line 375: warning: goto statement may
cause irreducible
          control flow
                        goto finalize_keys;

"/tmp/", line 423: warning: goto statement may cause irreducible
          control flow
                if (rounds_and_swapped == 0x100) goto next;

"/tmp/", line 458: warning: goto statement may cause irreducible
          control flow
                if (--rounds_and_swapped) goto start;

"/tmp/", line 461: warning: goto statement may cause irreducible
          control flow
                if (--iterations) goto swap;

"/tmp/", line 474: warning: goto statement may cause irreducible
          control flow
                goto start;

"/tmp/", line 481: warning: goto statement may cause irreducible
          control flow
                goto body;

Benchmarking: DES BS [128/128 BS XOP-16]... DONE
Many salts:     19942K c/s real, 130023K c/s virtual
Only one salt:  15617K c/s real, 45875K c/s virtual

I haven't posted the patch yet.


Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.