Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 2 Sep 2015 19:35:08 +0300
From: Solar Designer <>
Subject: Re: interleaving on GPUs

On Sun, Aug 23, 2015 at 08:08:06AM +0300, Solar Designer wrote:
> I just read this about NVIDIA's Kepler (such as the old GTX TITAN that
> we have in super):
> "Also note that Kepler GPUs can utilize ILP in place of
> thread/warp-level parallelism (TLP) more readily than Fermi GPUs can.
> Furthermore, some degree of ILP in conjunction with TLP is required by
> Kepler GPUs in order to approach peak single-precision performance,
> since SMX's warp scheduler issues one or two independent instructions
> from each of four warps per clock.  ILP can be increased by means of, for
> example, processing several data items concurrently per thread or
> unrolling loops in the device code, though note that either of these
> approaches may also increase register pressure."
> Note that they explicitly mention "processing several data items
> concurrently per thread".  So it appears that when targeting Kepler, up
> to 2x interleaving at OpenCL kernel source level could make sense.
> On Maxwell (such as the newer GTX Titan X), interleaving shouldn't be
> needed anymore:
> "The power-of-two number of CUDA Cores per partition simplifies
> scheduling, as each of SMM's warp schedulers issue to a dedicated set of
> CUDA Cores equal to the warp width.  Each warp scheduler still has the
> flexibility to dual-issue (such as issuing a math operation to a CUDA
> Core in the same cycle as a memory operation to a load/store unit), but
> single-issue is now sufficient to fully utilize all CUDA Cores."
> I interpret the above as meaning that 2x interleaving can still benefit
> Maxwell at low occupancy, but we can simply increase the occupancy and
> achieve the same effect instead.

Also relevant, about GCN:

"We have two ways of latency hiding:

* By issuing multiple ALU operations on different registers before
waiting for load of specific value into given register.  Waiting for
results of a texture fetch obviously increases the register count, as
increases the lifetime of a register.
* By issuing multiple wavefronts on a CU - while one wave is stalled on
s_waitcnt, other waves can do both vector and scalar ALU.  For this one
we need multiple waves active on a CU."

This blog post doesn't exactly recommend interleaving (when it's
possible memory-wise, we could as well issue more wavefronts instead),
but it does recommend having some instruction-level parallelism.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.