john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHVMfonhrZ5fxVA2w2hZ3zWGsbyFkHycQXQmiq8L=dQ4TQ@mail.gmail.com>
Date: Fri, 8 May 2015 21:25:16 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

2015-05-07 16:09 GMT+02:00 Solar Designer <solar@...nwall.com>:
> Agnieszka,
>
> On Thu, May 07, 2015 at 03:30:43PM +0200, Agnieszka Bielec wrote:
>> 2015-05-05 20:00 GMT+02:00 Solar Designer <solar@...nwall.com>:
>> > On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote:
>> >> 2015-04-27 3:50 GMT+02:00 Solar Designer <solar@...nwall.com>:
>> >>
>> >> > BTW, bumping into total GPU global memory size may be realistic with
>> >> > these memory-hard hashes.  Our TITAN's 6 GB was the performance
>> >> > limiting factor in some of the benchmarks here:
>> >> > http://www.openwall.com/lists/crypt-dev/2014/03/13/1
>> >>
>> >> I use only 128MB
>> >
>> > What happens if you increase GWS further?  Does performance drop?  What
>> > if you manually increase GWS even further?  It might happen that the
>> > auto-tuning finds a local minimum, whereas a higher GWS is optimal.
>>
>> the speed drops significantly when I make gws x2 bigger
>
> Can you try making it bigger yet anyway?  This probably won't help, but
> it may be worth trying.
>

I tested 2x, 4x and 8x, and the bigger is gws the worse are results.
for 16x I get (CL_INVALID_BUFFER_SIZE) and
(CL_MEM_OBJECT_ALLOCATION_FAILURE)

>
>> > Also, I notice there are some if/else in G and H macros.  Are they
>> > removed during loop unrolling, or do they translate to exec masks in the
>> > generated code?
>>
>> I cached values from memory into variables and I must check if
>> i0==index_global and i0==index_local, it's faster with this. In F all
>> workitems execute the same if-else branch but not in H. I didn't
>> disassemble the code yet. I doubt
>
> I don't understand.
> What exactly have you cached?
>
> Do you expect the "i0==index_local" and "i0==index_global" conditions to
> often be true, or are these rare special cases?

special case

 >I'd expect the latter,  but I don't see the purpose.z

original code looks like:
S[i0]=some operations
S[x]=some operations on S[i0]
and again S[i0]=some operations
and so on

I cache values into variables (v,v1) at te beginnng of function F and
save at the end of  H or G functions. I must to make different set of
instructions when address of the first value is equal to the second
one

>
>> >> and the gws number with the memory usage were the same, I can nothing
>> >> to do with this bottleneck
>> >>
>> >> but If I remove everything from the code, GWS also doesn't differ
>> >
>> > "Everything"?
>>
>> if I change my function into pomelo_crypt_kernel(args...) { nothing  }
>> but sorry, this was a false positive, If i set manually gws in this
>> case everything looks normal
>
> Does this suggest that GWS auto-tuning does not work correctly?

maybe but this is a special case. I can also set variables for auto
tune wrong for the "empty" crypt_kernel.
It happens when I'm allocating buffer as usual and also when I remove
it  but I did't changed auto-tune options

>
>> > AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU.
>> > See http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
>> > slide 10.
>>
>> I checked local memory size using this code
>>
>> clGetDeviceInfo(devices[gpu_id],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&local_memory_size,NULL);
>>     printf("mamy %llu\n",(unsigned long long) local_memory_size);
>>
>> and I was getting 48 and 32 KB
> Which devices do these correspond to?

32768 for --dev=1 and 49152 for --dev=5
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.