john-dev - Re: Split kernel for OpenCL WPA-PSK

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <feaffe5ff6b415e27b002d28f95c690f@smtp.hushmail.com>
Date: Thu, 8 Nov 2012 19:12:40 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Split kernel for OpenCL WPA-PSK

On 7 Nov, 2012, at 18:12 , magnum <john.magnum@...hmail.com> wrote:
> On 7 Nov, 2012, at 17:43 , Lukas Odzioba <lukas.odzioba@...il.com> wrote:
>>> For some reason it segfaults on the Tahiti (but not on AMDAPP/CPU). On all other devices I've tried it works fine. If we can get this straight we should implement similar changes to a bunch of other OpenCL formats that use PBKDF2-HMAC-SHA1.
>> 
>> On Tahiti it segfaults during selftest, testsuite or some real world cracking?
> 
> It segfaults during self-test. The debugger ends up within the amdocl drivers. I suspect it's yet another driver bug. But I have tried it with 12.8 too and it did not help.

I nailed it. Definitely a driver bug but I found a way around it. This is the original code, in the loop kernel, that segfaults:

        for (i = 0; i < 5; i++) {
                W[i] = state[gid].W[i];
                ipad[i] = state[gid].ipad[i];
                opad[i] = state[gid].opad[i];
                out[i] = state[gid].out[i];
        }

I suspect the compiler optimizer tried to rearrange them (a really trivial task) to get coalesced reads from global memory, but screwed up royally. So I did the compiler's job, and this works fine:

        for (i = 0; i < 5; i++)
                W[i] = state[gid].W[i];
        for (i = 0; i < 5; i++)
                ipad[i] = state[gid].ipad[i];
        for (i = 0; i < 5; i++)
                opad[i] = state[gid].opad[i];
        for (i = 0; i < 5; i++)
                out[i] = state[gid].out[i];


>>> These are massive changes to both host code and kernel. Some 15-20% boost is gained too btw, and device auto-tuning is implemented.
>> 
>> 15-20% just for nvidia or for amd too?
> 
> I have no idea because of the segfaults but I implemented the usual bitselects, rotate and stuff so if anything, it should be faster. Also, the split kernels reduces register pressure. Some other changes I made released even more registers. On the other hand we depend on global memory between the loop calls. That does not stop office2007 from doing 2.1 billion SHA1/second though, although that one has smaller global memory footprint than this one.

Using device 0: Tahiti
Local worksize (LWS) 192, Global worksize (GWS) 196608
Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
Raw:    66197 c/s real, 137970 c/s virtual

OK, the boost is over 50%... but a lot of the boost is merely from the new device auto-tuning. So the kernel is not much faster, but it /is/ faster and my primary goal was just to avoid a performance regression when splitting. BTW, I presume the old code would produce ASIC hangs on the Tahiti sooner or later. No risk for that now.

This code too does over 2.1 billion SHA1/second, but CPU post-processing nearly halves the speed (without OMP). So I'm in the process of moving all of that post-processing to GPU. It's just a couple HMACs more, so I hope to exceed 120K c/s with that in place.

For now, I committed the current code (with CPU post-processing). Please test.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.