Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 23 Aug 2015 23:05:30 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: interleaving on GPUs

On 2015-08-23 07:08, Solar Designer wrote:
> I just read this about NVIDIA's Kepler (such as the old GTX TITAN that
> we have in super):
>
> http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#device-utilization-and-occupancy
>
> "Also note that Kepler GPUs can utilize ILP in place of
> thread/warp-level parallelism (TLP) more readily than Fermi GPUs can.
> Furthermore, some degree of ILP in conjunction with TLP is required by
> Kepler GPUs in order to approach peak single-precision performance,
> since SMX's warp scheduler issues one or two independent instructions
> from each of four warps per clock.  ILP can be increased by means of, for
> example, processing several data items concurrently per thread or
> unrolling loops in the device code, though note that either of these
> approaches may also increase register pressure."
>
> Note that they explicitly mention "processing several data items
> concurrently per thread".  So it appears that when targeting Kepler, up
> to 2x interleaving at OpenCL kernel source level could make sense.

Shouldn't simply using vectorized code (eg. using uint2) result in just 
the interleaving we want (on nvidia)? I tested this with some of our 
formats that can optionally run vectorized but they don't seem to gain 
from --force-vector=2.

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.