john-dev - Re: OpenCL kernel max running time vs. "ASIC hang"

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALaL1qBS3y2yBY4px8uS18SzzxNqsN8YV-sLZ-ir5pMkKHmoEQ@mail.gmail.com>
Date: Mon, 25 Jun 2012 17:02:44 -0700
From: Bit Weasil <bitweasil@...il.com>
To: Solar Designer <solar@...nwall.com>
Cc: john-dev@...ts.openwall.com
Subject: Re: OpenCL kernel max running time vs. "ASIC hang"

> I understand that reducing the amount of parallelism in a kernel
> invocation slows things down, but why not reduce the amount of work per
> kernel invocation by other means - specifically, in your example, why
> not reduce the number of SHA-1 iterations per kernel invocation?  We may
> invoke the kernel more than once from one crypt_all() call,
> sequentially.  For example, the 256k may be achieved by 256 invocations
> of a kernel doing 1k iterations.  This would bring the 9 seconds down to
> 35 ms per kernel invocation.  Perhaps the intermediate results can even
> stay in the GPU between those invocations.
>

This is what I do with my rainbow table generation (which is, in many
cases, functionally the same as a "slow" kernel).  I take an initial
password, hash/reduce it many times (say, 200 000 for my current tables),
and store the end result.  I do this with tunable kernel execution times
(this is the task that was getting ASIC hangs until I adjusted it down).

I simply store the intermediate values in the GPU global memory.  The
access (if done sanely) is coalesced, and is roughly speaking a "best case"
memory access pattern for both the load and the store.  I'm using a high
resolution timer class to dynamically adjust the work done per kernel
invocation.  If I'm below 90% or above 110% of my target time, I adjust the
steps per invocation for the next call.  It seems to work nicely, and also
properly handles conditions like an overheating GPU that throttles, or
someone gaming in the background.  The only caution I have with this
approach is that it is possible to get 0 steps per invocation in certain
conditions, and this will not self correct - the task does not accomplish
anything.  You need a lower bound so that it can correct for a temporary
glitch.

It shouldn't be difficult to take a single execution kernel and break it
into multiple steps.  If you would like a starting point, the Cryptohaze
tools have this done for all the GPU kernels - feel free to take a look
around.

Content of type "text/html" skipped

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.