Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 06 Mar 2012 23:39:01 +0100
From: Samuele Giovanni Tonon <samu@...uxasylum.net>
To: john-dev@...ts.openwall.com
Subject: Re: OpenCL KPC and LWS

On 03/06/12 20:14, magnum wrote:
> On 02/21/2012 10:59 PM, Samuele Giovanni Tonon wrote:
>>> So the main issue is that auto KPC does not pick a good number. The LWS
>>> fluctuations might be due to normal variations between runs. I should
>>> have recorded the figures for KPC=2M and LWS=64 but I missed that.
>>
>> looks like a chicken-egg problem: when lws is tested i use the default
>> kpc=2M, when LWS is up i use the best LWS i just detected; luksas
>> already reported this kind of problem but i thought we were safe since
>> LWS usually is rather obvious.
> 
> I realise I have a lot to catch up from you guys but here are a couple
> of things that seem to get good and FAST results on my gear, both GPU
> and CPU:
> 
> Have you tried querying
> clGetKernelWorkGroupInfo() for
> CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE? On the CPU's and GPU's I
> have tried, it seems reliable (better than the current testing) and very
> fast. It's introduced in OpenCL 1.1 so I added a fallback like this:
> 
> 
>   	clEnqueueWriteBuffer(queue_prof, buffer_keys, CL_TRUE, 0,
>   	    (PLAINTEXT_LENGTH) * SSHA_NUM_KEYS, saved_plain, 0, NULL, NULL);
> 
>  +	// This is OpenCL 1.1, we catch CL_INVALID_VALUE and use a fallback
>  +	ret_code = clGetKernelWorkGroupInfo (crypt_kernel, devices[gpu_id],
>  +	    CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
>  +	    sizeof(best_multiple), &best_multiple, NULL);
>  +
>  +	if (ret_code == CL_INVALID_VALUE) {
>  +	    //printf("Can't get preferred LWS multiple, using 1\n");
>  +	    best_multiple = 1;
>  +	} else {
>  +	    HANDLE_CLERROR(ret_code, "Query preferred work group multiple");
>  +	    //printf("preferred multiple: %zu\n", best_multiple);
>  +	}
>  +
>   	// Find minimum time
>  -	for (my_work_group = 1; (int) my_work_group <= (int) max_group_size;
>  +	for (my_work_group = best_multiple; (int) my_work_group <= (int)
> max_group_size;
>   	    my_work_group *= 2) {

nice one, i'm adding it to the code, thanks !


> 
> Also, I seem to get good and very fast results with this loop in KPC
> enumeration:
> 
>     for( num=local_work_size; num <= SSHA_NUM_KEYS ; num<<=1)
> 
> 
> Is testing every 16K really of any use? I just see fluctuating numbers
> and a super slow test.

this was the idea i had in mind: by default you will get SSHA_NUM_KEYS,
which is the standard, if you set KPC=0 it means you want to do a deep
benchmark on which could be the real best kpc; 16384 seemed a reasonable
tradeoff between having a very long but detailed benchmark rather 3-4
test which could be misleading.

i'm still testing it: on the others format the step is 4096 but
*_NUM_KEYS is 1024*2048, on nsldaps i found 1024*2048*4 was giving
higher speed with high end cards so i decided to keep a high value but
incrementing the steps to not die of boredom (as it already happens).
i'm still looking for the best number for the steps, while doubling
seems good for a quick benchmark i still think many cards find the best
kpc between 1024*1024 and 1024*1024*2 , steps that the doubling miss.

Samuele


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.