Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 22 Oct 2013 14:10:06 +0200
From: magnum <>
Subject: Re: OpenCL vectorizing how-to.

On 2013-10-22 04:08, Lukas Odzioba wrote:
> 2013/10/21 magnum <>:
>> One thing I can't understand is why pre-vectorized code with the correct
>> width is not used "as-is" by these compilers. Apparently the compiler first
>> scalarizes it and then re-vectorizes it - with very poor results, at least
>> on Well. OTOH this isn't a problem now that we can supply the requested
>> [lack of] width.
> "(...)We're likely to generate better code(...)" :)
> I guess better is not necessarily faster :)

Thanks, that was interesting. He did not fully answer my question 
though. If I supply a kernel with the native width of the device, they 
could compile it to eg. AVX2 or AVX512 instructions right away with no 
added execution masks or other overhead. In some cases even the key 
buffer as supplied from host code can be vectorized (and we could do the 
same with the output buffer) - but if they ask me for scalar code they 
will obviously get a scalar buffer so the end result will be 
unnecessarily complicated vectorized code dealing with that. This 
particular case is not currently affecting any inner loop though.

I think auto-vectorizing is a really great thing but I can't see why 
they refuse to use pre-vectorized code as supplied. Apparently the 
assumption is that noone vectorizes (or should vectorize) for 
performance, only for the problem domain, as he puts it at ~15:05. Maybe 
in the long run they are right. Or maybe future versions of their 
optimizers will be able to analyze pre-vectorized code and decide 
whether it could be used without "re-vectorizing".


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.