Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 7 May 2015 15:30:43 +0200
From: Agnieszka Bielec <>
Subject: Re: [GSoC] John the Ripper support for PHC finalists

2015-05-05 20:00 GMT+02:00 Solar Designer <>:
> On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote:
>> 2015-04-27 3:50 GMT+02:00 Solar Designer <>:
>> > BTW, bumping into total GPU global memory size may be realistic with
>> > these memory-hard hashes.  Our TITAN's 6 GB was the performance
>> > limiting factor in some of the benchmarks here:
>> >
>> I use only 128MB
> What happens if you increase GWS further?  Does performance drop?  What
> if you manually increase GWS even further?  It might happen that the
> auto-tuning finds a local minimum, whereas a higher GWS is optimal.

the speed drops significantly when I make gws x2 bigger

>> for the experimenst i removed almost everything from my code except
>> the biggest bottleneck:
> That's a good experiment.
>>         v1=vload4(0,S+sMAP(index_local));       \
>>         v= v+(v1<<1); \
>>         v1=v1+(v<<2);             \
>>         vstore4(v1,0,S+sMAP(index_local));               \
>> \
>>         random_number  = S[sMAP(i3)];              \
> BTW, can you explain why sMAP is as it is?
> #define sMAP(X) ((X)*GID+gid4)
> where:
>         gid = get_global_id(0);
>         GID = get_global_size(0);
>         gid4 = gid * 4;

sMAP, MAP and cMAP stands for coalescing (IIRC it's around 5 (maybe
more) times faster with it)

> Also, I notice there are some if/else in G and H macros.  Are they
> removed during loop unrolling, or do they translate to exec masks in the
> generated code?

I cached values from memory into variables and I must check if
i0==index_global and i0==index_local, it's faster with this. In F all
workitems execute the same if-else branch but not in H. I didn't
disassemble the code yet. I doubt

>> and the gws number with the memory usage were the same, I can nothing
>> to do with this bottleneck
>> but If I remove everything from the code, GWS also doesn't differ
> "Everything"?

if I change my function into pomelo_crypt_kernel(args...) { nothing  }
but sorry, this was a false positive, If i set manually gws in this
case everything looks normal

> Yes, caching the index_local portion of S[] in local memory (or
> you can also try private memory on NVIDIA) makes sense to me.  A drawback
> is that for all writes to S[], you'd have to check if the index is low
> enough that the write needs to go to this cached copy (as well as
> possibly to the global copy, to avoid having to perform a similar check
> on global_index reads, or you can use the cache there as well - it's
> unclear which will run faster).
>> we could cache this segment but today graphic cards
>> rarely has 64KB of local memory, on super we have 48kB and 32KB
>> it is even worse because this is 64KB for the work-group
>> we don't know how much the __private memory we have , we can only see
>> if the kernel compilation failed or not, but I'm not sure of this
> I think you should generally prefer to use local rather than private
> memory for this.

I think so too

> AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU.
> See
> slide 10.

I checked local memory size using this code

    printf("mamy %llu\n",(unsigned long long) local_memory_size);

and I was getting 48 and 32 KB


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.