Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 29 Aug 2015 07:47:07 +0300
From: Solar Designer <>
Subject: Re: PHC: Argon2 on GPU


On Mon, Aug 24, 2015 at 03:42:30PM +0200, Agnieszka Bielec wrote:
> 2015-08-24 4:28 GMT+02:00 Solar Designer <>:
> > On Mon, Aug 24, 2015 at 01:52:35AM +0200, Agnieszka Bielec wrote:
> >> 2015-08-23 8:15 GMT+02:00 Solar Designer <>:
> >> > While private memory might be larger and faster on specific devices, I
> >> > think that not making any use of local memory is wasteful.  By using
> >> > both private and local memory at once, we should be able to optimally
> >> > pack more concurrent Argon2 instances per GPU and thereby hide more of
> >> > the various latencies.
> >>
> >> why will we pack more argon2 per gpu using both types of memory?
> >> I'm using only very small portions of private memory.
> >
> > You're using several kilobytes per instance - that's not very small.
> >
> > If not this, then what is limiting the number of concurrent instances
> > when we're not yet bumping into total global memory size?  For some of
> > the currently optimal LWS/GWS settings, we're nearly bumping into the
> > global memory size, but for some (across the different GPUs, as well as
> > 2i vs. 2d) we are not.  And even when we are, maybe a higher LWS would
> > improve performance when we can afford it.
> the second option is that we reached the point when after increasing
> gws number, we can't get more access to global memory and most of
> work-items are waiting for memory.
> argon2i is coaelesced  and it can run using more gws than argon2d,

If we calculate our global memory bandwidth usage from the observed c/s
figures, it's easy to see it's still several times lower than what these
GPUs have available - and as you point out, with 2i's coalescing we
should actually be able to use the full bandwidth.  (With 2d, we might
not be because we need accesses narrower than what's optimal for the

However, you're right - we might be bumping into the memory bus in the
current code anyway.  Maybe the coalescing isn't good enough.  Or maybe
we get too many spills to global memory - we could review the generated
code to see where they are and where they go to, or just get rid of them
without such analysis if we can.

I've just briefly tried running your 2i and 2d kernels from your main
branch (not the vector8 stuff) on Titan X - and the results are
disappointing.  Performance is similar to what we saw on the old Titan,
whereas the expectation was it'd be a multiple of what we saw on your
960M.  Can you please experiment with this too, and try to use LWS and
GWS settings directly scaled from those that you find performing good on
your 960M (perhaps it means same LWS, but ~4.8x larger GWS)?  In your
most recent full set of benchmark results, you didn't include the
auto-tuning output (no -v=4), so I don't know what LWS and GWS you were
using in the 960M benchmarks.

Like I said, my initial results are not good, and I did try a few LWS
and GWS combinations (up to using nearly the full 12 GB memory even).
So I don't expect you would succeed either, but I'd like us to have a
direct comparison of 960M vs. Titan X anyway, so that we can try to
figure out what the bottleneck in scaling Argon2 between these two GPUs
is.  And the next task might be to deal with the register spilling.

If things just don't fit into private memory, then we might prefer to
explicitly move some into local or/and global than leave this up to the
compiler and keep guessing what's going on.  For a start, we need to
achieve the same performance as we do now, but without spills and with
explicit use of other memory types.  And after that point, we could
proceed to optimize our use of the different memory types.

Titan X is -dev=4 on super now, although I expect this device number
will change once we re-introduce the HD 7990 card.



Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ