Date: Tue, 5 May 2015 21:00:07 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: [GSoC] John the Ripper support for PHC finalists On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote: > 2015-04-27 3:50 GMT+02:00 Solar Designer <solar@...nwall.com>: > > > BTW, bumping into total GPU global memory size may be realistic with > > these memory-hard hashes. Our TITAN's 6 GB was the performance > > limiting factor in some of the benchmarks here: > > http://www.openwall.com/lists/crypt-dev/2014/03/13/1 > > I use only 128MB What happens if you increase GWS further? Does performance drop? What if you manually increase GWS even further? It might happen that the auto-tuning finds a local minimum, whereas a higher GWS is optimal. > for the experimenst i removed almost everything from my code except > the biggest bottleneck: That's a good experiment. > v1=vload4(0,S+sMAP(index_local)); \ > v= v+(v1<<1); \ > v1=v1+(v<<2); \ > vstore4(v1,0,S+sMAP(index_local)); \ > \ > random_number = S[sMAP(i3)]; \ BTW, can you explain why sMAP is as it is? #define sMAP(X) ((X)*GID+gid4) where: gid = get_global_id(0); GID = get_global_size(0); gid4 = gid * 4; Also, I notice there are some if/else in G and H macros. Are they removed during loop unrolling, or do they translate to exec masks in the generated code? > and the gws number with the memory usage were the same, I can nothing > to do with this bottleneck > > but If I remove everything from the code, GWS also doesn't differ "Everything"? You should try tuning GWS manually. IIRC, we support invocation like "GWS=... ./john ..." > another thing > > index_local = (((i + j) >> 2) - 0x1000 + (random_number & 0x1fff)) > & mask; \ > > 0x1fff*8 = 64KB, You're forgetting the "& mask". For low m_cost, mask is smaller than 0x1fff. Yes, caching the index_local portion of S in local memory (or you can also try private memory on NVIDIA) makes sense to me. A drawback is that for all writes to S, you'd have to check if the index is low enough that the write needs to go to this cached copy (as well as possibly to the global copy, to avoid having to perform a similar check on global_index reads, or you can use the cache there as well - it's unclear which will run faster). > we could cache this segment but today graphic cards > rarely has 64KB of local memory, on super we have 48kB and 32KB > it is even worse because this is 64KB for the work-group > > we don't know how much the __private memory we have , we can only see > if the kernel compilation failed or not, but I'm not sure of this I think you should generally prefer to use local rather than private memory for this. AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU. See http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf slide 10. I think OpenCL private (not local) memory maps to the register file on AMD devices. We've got 256 KB of register file memory per CU (so 4x more than LDS), but there's a limit of 256 VGPRs per work-item, so you shouldn't be able to fit a large array in there (at least without custom GPU firmware). Were you able to use much private memory for POMELO on an AMD GCN card? I'd be surprised if so. Oh, I see your "private dev=1" table in tests.ods is only filled for m_cost=0. Was this like 128 x 512-bit VGPRs, for 8 KB? Please try local memory instead. It should let you do m_cost=2 as well. For low m_cost where either works, it does make sense to compare speeds for private vs. local. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.