Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 5 Sep 2012 07:22:40 +0400
From: Solar Designer <>
Subject: Re: bitslice DES on GPU

Hi Sayantan,

On Tue, Sep 04, 2012 at 11:04:45PM +0530, Sayantan Datta wrote:
> How do I test the LM hashes for the original cpu implementation ? Is it
> sufficient to set the LM parameter to 1 in the DES_bs_init() function ?

You should be doing whatever LM_fmt.c does.  That's probably not the
kind of answer you wanted, but I'm afraid I don't have a better one.
In short, yes, DES_bs_init() is called with 1 for LM, and proper
functions are used after that point.

Anyhow, I took a look at your code, and I tested it briefly to see where
the bottlenecks are.  Many things are done non-optimally, which I guess
you're aware of:

1. DES_EXPAND = 1 is probably not an optimal choice (although it was/is
OK to try it as well).

2. The expanded keys (and some other fields) should only exist on the
GPU side, yet you have space reserved for them in opencl_DES_bs_combined
on the CPU side as well.  This probably does not directly cost any CPU
time (unless I missed something), but it may result in less optimal use
of caches on the CPU (potentially more cache tag conflicts between
fields of opencl_DES_bs_combined that you do actually use).

3. You're copying too much data to/from GPU.  For example, there's no
point in copying B to GPU, and there's usually no point in copying K
from GPU (although you do need to preserve the partially transposed and
expanded key bits across multiple calls for the same salt somehow).
There are other minor things to exclude from the copying as well.

4. A lot of time is spent on the CPU side.  Some testing I did suggests
that it's around 33% (when run on bull's CPU and HD 7970).  You have
some extra copying on the CPU side as well.

5. DES_bs_cmp_all() became slower than the usual CPU implementation
because it now uses 32-bit rather than 64-bit vector elements.

6. You're using NVIDIA-friendly S-box expressions.  This results in
unneeded overhead when running on AMD GPUs.  Also, you don't have vsel()
defined to use bitselect() - you'll need to fix that when you switch to
proper S-box expressions for AMD.  You'll need to support both of these,
so that we can run reasonable benchmarks on both GPUs.

7. You keep too many of the things in global memory.  With DES_EXPAND = 1,
you have to keep the expanded keys in global memory (but at least
they're read sequentially).  You don't have to keep B and E in global
memory.  BTW, what do you mean by trying to declare only some of the
struct fields __global?  I doubt that this is valid - either the entire
struct is in global memory or the entire struct is in local memory - no?
I'd expect that you'd need to split it into two structs if you have
these two kinds of fields.

I listed the above in arbitrary order.  Of the above, you may want to
start by fixing 6, 7, 3, 4 - maybe in this order.

To see how much of the slowness is from the extra copying and such
rather than from the inner loop being slow, I tried increasing the
iteration count from 25 to 725 and using this test hash:

	{"..X8NBuQ4l6uQ", ""},

It's not a valid DES-based crypt(3) hash because those use 25 iterations
(not configurable), but I generated it for my performance testing anyway.

With this, I got roughly twice higher speed (after multiplying by 29).
Thus, the "overhead" accounts for roughly 50% of the total running time
of the current code.  The code remains too slow even with the overhead
mostly removed (due to the much higher iteration count like this).

Last but not least, please add proper statements to the source files so
that your revisions of them are not misattributed to me.  Right now,
some of your modified files say that they've been written by me, which
is no longer entirely true.



Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ