Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 22 Mar 2012 11:10:15 +0200
From: Milen Rangelov <>
Subject: Re: CUDA & OpenCL status

Hello Alexander,

Milen - any additional info on that "fused SHL+ADD instruction" on
> Nvidia and its use for MD5 and the like?  I don't immediately see how
> such an instruction would be usable there because we actually need
> rotate+ADD.

Problem is that NVidia does not have the rotate. AMD has BITALIGN_INT that
does it. With NVidia, rotate() would actually do (a<<s)|(a>>(32-s)). With
Fermi you can have the SHL+ADD thing (which I guess is just a 32-bit MAD in
fact), but rotate() still does not do the trick. What I do is something

#define ROTATE (a<<S)+(a>>(32-s))

Which makes the compiler emit the needed instructions. Of course that's not
as good as having a single "rotate" instruction, but still doing 2 bitwise
ops is better than doing 3. This works only on sm_2x architectures, so it
really does not matter on say 9800GT.

Just a side note, you don't need to explicitly use amd_bitalign from the
cl_amd_media_ops extensions as rotate() maps to bitalign since SDK 2.3 or
even before that. Since recently (Catalyst 12.2), bitselect() is mapped to
BFI_INT too. Thus no need to do the binary patching to get bfi working.



Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ