Date: Fri, 14 Sep 2012 20:10:53 +0200 From: Simon Marechal <simon@...quise.net> To: john-dev@...ts.openwall.com Subject: Re: intrinsics: speed up for linux-x86-64-native On 09/14/2012 07:21 PM, magnum wrote: > It's not very intuitive but AFAIK it was made that way on purpose: The intention is to get code that hides latency, much like GPU coding. This is pretty compiler-dependant and that is why icc can make such a great difference. > > That is about all I know so I'll leave the details to the experts. I hope someone can give a more thorough explanation - I'd read it with interest. This is exactly the right explanation :) I haven't read an Intel optimization guide since the pentium, but AFAIK this should mainly hide instruction dependency latency. Processors are designed around an instruction pipeline, where instructions are being decoded and preprocessed during this pipeline. Every time you increase the pipeline length, you can reduce the complexity of each step, and thus increase the frequency. The problem is that there are several kind of events that lead to instructions not being able to enter the pipeline, or to be stalled, reducing the instructions/cycles ratio. This means that with long pipelines you get huge penalties. This is the reason why the Netburst architecture (Pentium 4) was under-performing, even though it was designed for raw clock speed. It had a huge pipeline and bad reordering and prediction facilities. For example, with conditional branching a single branch is being fed to the pipeline. If the prediction is wrong, the pipeline must be emptied and refilled again. This is the reason why so much effort is spent in CPU and compiler design to predict properly the execution flow. Another case, which is specifically addressed with this optimization, is instruction dependency. If an instruction uses the result of another, it is stalled in the pipeline until the later is finally executed. So we are replacing : b <- a c <- b (full penalty between those two steps) ... by b1 <- a1 b2 <- a2 b3 <- a3 c1 <- b1 c2 <- b2 c3 <- b3 (penalty is reduced by 2 pipeline steps here) Modern processors can reorder instructions dynamically to reduce the impact of such dependencies. This is a bit more complicated, as our instructions work with registers (memory local to a core that is very fast and very scarce), and data exchange between registers and other memory must be reduced. This means that if we overdo this, we won't have enough registers to hold all intermediate values, start using slow memory, and kill performance. ICC is really good at scheduling register usage, which explains the huge speed increase. In GPU what you want to hide is mostly memory latency. You do that by running a lot more programs than you have processing units available. Each processing unit has many times more registers than a CPU, for this purpose. When a memory access is performed, another thread takes over while the memory is transfered. I believe that optimizing GPU workloads is a LOT of fun. I hope this roughly explains what is going on, and that it is not too wrong.
Powered by blists - more mailing lists
Powered by Openwall GNU/*/Linux - Powered by OpenVZ