john-dev - Re: GTX TITAN (was: new dev box wishes)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABh=JRGBx80WOm+Aj=PV8Kq=kNCFSJMrAotdJFkMirRyB3NJZA@mail.gmail.com>
Date: Mon, 16 Sep 2013 00:22:47 +0300
From: Milen Rangelov <gat3way@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: GTX TITAN (was: new dev box wishes)

That's a bit strange.

I have almost never seen any benefits from vectorizing on 7970 (well expect
one or two cases).

There are occasions where the same code, vectorized, gives better
performance on 7970, but then if you can transform the scalar code so that
more work is being done in the kernel (e.g by doing 4 consecutive
operations in a loop as compared to using 4x vectors) you'd eventually come
up with a faster, scalar solution given the same ammount of work (global
work size / vector size). In case global work size is the same, the
vectorized solution may seem faster just because overall you have more
kernel launches per second with the scalar code as compared to vector code
and kernel launch latency and host-device transfers then come into play. I
think AMD APP profiler can be very helpful to figure out what's happening
in such cases.

Regards,
Milen

On Sun, Sep 15, 2013 at 11:08 PM, Solar Designer <solar@...nwall.com> wrote:

> On Sun, Sep 15, 2013 at 09:12:27AM -0700, Alain Espinosa wrote:
> > On 9/14/13, Solar Designer <solar@...nwall.com> wrote:
> > > I guess these results may teach us something about optimization for
> this
> > > GPU (and other Kepler GPUs?) - four-element vectors or(/and?)
> > > interleaving of independent instructions give best results.
> >
> > For a GT 630 (compute capability 2.1) using a vector of 3 elements for
> > NTLM hashing give a ~15-20% performance increase compared with a
> > vector of 1 element. I think vectors of 2-3 elements are best because
> > they reduce the number of registers providing sufficient parallelism,
> > but i do not test this assertion in a 3.5 GPU.
>
> FWIW, I read yesterday that TITAN allows for 4x more registers per
> thread than other GTX 7xx GPUs do.  With CUDA, we can probably simulate
> either behavior (for tuning of our code for other GTX 7xx as well) by
> adjusting the target arch, but I don't know if we can do that with
> NVIDIA's OpenCL too (without having an actual lower-than-TITAN GTX 7xx
> card).  (And we're almost exclusively using OpenCL now, with our CUDA
> code mostly abandoned - it works, but it lacks auto-tuning and we're not
> optimizing it further.)
>
> Alexander
>

Content of type "text/html" skipped

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.