john-dev - Shared find_best_workgroup (was: Memory leak in most OpenCL formats)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4733b8bf6ba1319c89ff52a5b968a6cb@smtp.hushmail.com>
Date: Wed, 10 Oct 2012 10:20:55 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Shared find_best_workgroup (was: Memory leak in most OpenCL formats)

On 9 Oct, 2012, at 13:48 , Claudio André <claudioandre.br@...il.com> wrote:
> Em 08-10-2012 19:06, magnum escreveu:
>> On 8 Oct, 2012, at 23:35 , Claudio André <claudioandre.br@...il.com> wrote:
>>> Em 08-10-2012 18:01, magnum escreveu:
>>>> Claudio, this second fix removes the memory leak for your sha2crypt formats too, but they still are somewhat broken since you implemented split kernels, as the shared function can only handle one event and in your case that is currently clEnqueueReadBuffer()... nothing else is measured!
>>> Agree. I will revert to the previous find_best_workgroup approach!
>> You might be able to get away with just this: Change all "profilingEvent" to "NULL" except the two ones (per format) that enqueue crypt_kernel. This will measure the most important kernel so it might do the trick. I just tried it and it seems to work fine.
> 
> Good strategy. I will try it.

For what it's worth I now implemented support for split kernels in the shared opencl_find_best_workgroup(). You can now either use profilingEvent for a single kernel, or use firstEvent and lastEvent for the first and last kernel of your split ones. Still, this did not give satisfactory results with my formats so while I committed this support I do not currently use it.

In your case, you'd use firstEvent when enqueing prepare_kernel and lastEvent when enqueing final_kernel. And NULL for the looped kernel. In the else clause of your crypt_all, just use profilingEvent. But this would need to be tested on many devices: In my case, it was very beneficial for some GPUs and very detrimental for others - and besides, it takes time, sometimes lots of time. So I opted to stay with semi-fixed LWS.

The current find_best function seem to work just fine for most other formats. For my slow kernels, I'm wondering if I should home in on GWS first, using a low LWS (like 32). Then, using that GWS, home in on LWS. I bet that might work better but it will still take too long. The only way to do it quicker is to only run the loop kernel during this testing and not a full crypt_all(). But we can't easily use that approach in a shared function - there will be varying requirements for preparation.

I'll do some googling (again). Someone ought to have this figured out already.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.