Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 5 May 2015 21:00:07 +0300
From: Solar Designer <>
Subject: Re: [GSoC] John the Ripper support for PHC finalists

On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote:
> 2015-04-27 3:50 GMT+02:00 Solar Designer <>:
> > BTW, bumping into total GPU global memory size may be realistic with
> > these memory-hard hashes.  Our TITAN's 6 GB was the performance
> > limiting factor in some of the benchmarks here:
> >
> I use only 128MB

What happens if you increase GWS further?  Does performance drop?  What
if you manually increase GWS even further?  It might happen that the
auto-tuning finds a local minimum, whereas a higher GWS is optimal.

> for the experimenst i removed almost everything from my code except
> the biggest bottleneck:

That's a good experiment.

>         v1=vload4(0,S+sMAP(index_local));       \
>         v= v+(v1<<1); \
>         v1=v1+(v<<2);             \
>         vstore4(v1,0,S+sMAP(index_local));               \
> \
>         random_number  = S[sMAP(i3)];              \

BTW, can you explain why sMAP is as it is?

#define sMAP(X) ((X)*GID+gid4)


        gid = get_global_id(0);
        GID = get_global_size(0);
        gid4 = gid * 4;

Also, I notice there are some if/else in G and H macros.  Are they
removed during loop unrolling, or do they translate to exec masks in the
generated code?

> and the gws number with the memory usage were the same, I can nothing
> to do with this bottleneck
> but If I remove everything from the code, GWS also doesn't differ


You should try tuning GWS manually.  IIRC, we support invocation like
"GWS=... ./john ..."

> another thing
> index_local    = (((i + j) >> 2) - 0x1000 + (random_number & 0x1fff))
> & mask;     \
> 0x1fff*8 = 64KB,

You're forgetting the "& mask".  For low m_cost, mask is smaller than
0x1fff.  Yes, caching the index_local portion of S[] in local memory (or
you can also try private memory on NVIDIA) makes sense to me.  A drawback
is that for all writes to S[], you'd have to check if the index is low
enough that the write needs to go to this cached copy (as well as
possibly to the global copy, to avoid having to perform a similar check
on global_index reads, or you can use the cache there as well - it's
unclear which will run faster).

> we could cache this segment but today graphic cards
> rarely has 64KB of local memory, on super we have 48kB and 32KB
> it is even worse because this is 64KB for the work-group
> we don't know how much the __private memory we have , we can only see
> if the kernel compilation failed or not, but I'm not sure of this

I think you should generally prefer to use local rather than private
memory for this.

AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU.
slide 10.

I think OpenCL private (not local) memory maps to the register file on
AMD devices.  We've got 256 KB of register file memory per CU (so 4x
more than LDS), but there's a limit of 256 VGPRs per work-item, so you
shouldn't be able to fit a large array in there (at least without custom
GPU firmware).  Were you able to use much private memory for POMELO on
an AMD GCN card?  I'd be surprised if so.

Oh, I see your "private dev=1" table in tests.ods is only filled for
m_cost=0.  Was this like 128 x 512-bit VGPRs, for 8 KB?  Please try
local memory instead.  It should let you do m_cost=2 as well.

For low m_cost where either works, it does make sense to compare speeds
for private vs. local.



Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.