john-dev - Re: OpenCL kernel max running time vs. "ASIC hang"

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f54d15f21aa317a2e2502e50fc5a388e@smtp.hushmail.com>
Date: Wed, 26 Sep 2012 08:00:11 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: OpenCL kernel max running time vs. "ASIC hang"

I have implemented this in office-opencl formats, and will do in RAR (that's harder though). I first targeted 100-200 ms and ASIC hangs disappeared. The performance virtually did not drop at all so for better X interactivity I tried going for 10 ms, and performance is *still* virtually the same, while X response is much better.

Specifically, these kernels include 50,000 or 100,000 rounds of SHA-1. I went from one fat 5-second kernel into three, with the middle (heavy) one called like this (1024 rounds per call):

        for (index = 0; index < 50000 / 1024; index++)
                HANDLE_CLERROR(clEnqueueNDRangeKernel(queue[ocl_gpu_id], Hash1k, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL), "failed in clEnqueueNDRangeKernel");

And my latest version use a #define currently set to 128 rounds per call:

        for (index = 0; index < 50000 / HASH_LOOPS; index++)
                HANDLE_CLERROR(clEnqueueNDRangeKernel(queue[ocl_gpu_id], HashLoop, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL), "failed in clEnqueueNDRangeKernel");


I thought this would have a lot greater impact on performance than it does. I do not transfer any data between these looped calls, the intermediate data stays in a global array not even mapped to host memory. Note though, that this does not help responsitivity of JtR itself, it still has to do all the job in crypt_all() before eg. responding to a key press. And this total time is still in the order of seconds. But ASIC hangs are history and you can read your mail while cracking :)

BTW, preliminary tests indicate we are now 20% faster than Ivan Golubev's Passcovery when attacking Office 2007 or 2010 formats. At least on Dhiru's ATI 6970. I am the king (for a while :)

magnum


On 26 Jun, 2012, at 6:14 , Bit Weasil <bitweasil@...il.com> wrote:

> There's an overhead to kernel launches, yes.
> 
> It's up to you.  I've not found long kernel launches to be reliable on any platform.  For CUDA, the kernel-killer kicks in unless the box is headless.  For OpenCL with AMD cards, it's unreliable due to ASIC hangs.  I've just not been able to make it work.
> 
> Keep it as an option - I can run long-duration kernels on all my code, by setting the "steps per invocation" to the total number.  I just don't do it most of the time.
> 
> On Mon, Jun 25, 2012 at 7:04 PM, SAYANTAN DATTA <std2048@...il.com> wrote:
> 
> 
> On Tue, Jun 26, 2012 at 4:57 AM, Solar Designer <solar@...nwall.com> wrote:
> On Tue, Jun 26, 2012 at 01:06:08AM +0200, magnum wrote:
> > On 2012-06-26 00:27, Solar Designer wrote:
> > >I discussed this matter with Bit Weasil on IRC a few days ago.
> > >According to him, we shouldn't be trying to spend more than 200 ms per
> > >OpenCL kernel invocation, or we'll face random "ASIC hang" issues on AMD
> [...]
> 
> > That's not an easy goal with slow formats. For RAR, with 256K rounds of
> > SHA-1, I currently don't get much below 2000ms on 7790, and that's with
> > GWS that produces a 40% slower c/s than what we currently use. For best
> > c/s we exceed 9 seconds. Then again, my code is made by a newbie. Making
> > it 10x faster would be nice for sure. But even Milen said his RAR kernel
> > ran for 2-3 seconds a while ago.
> 
> I understand that reducing the amount of parallelism in a kernel
> invocation slows things down, but why not reduce the amount of work per
> kernel invocation by other means - specifically, in your example, why
> not reduce the number of SHA-1 iterations per kernel invocation?  We may
> invoke the kernel more than once from one crypt_all() call,
> sequentially.  For example, the 256k may be achieved by 256 invocations
> of a kernel doing 1k iterations.  This would bring the 9 seconds down to
> 35 ms per kernel invocation.  Perhaps the intermediate results can even
> stay in the GPU between those invocations.
> 
> Have you considered that?
> 
> Alexander
> 
> Wouldn't calling clEnqueNDRangeKernel too many times will cause a performance hit?  What about pausing the execution for some time as requested by the user. Say we press 'P' which will pasue the execution so that the user can perform the graphic oriented tasks and then resume the execution when he is finished. 
> 
> Regards,
> Sayantan   
> 
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.