Date: Tue, 22 Nov 2011 03:48:45 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: OMP vs SSE On Tue, Nov 22, 2011 at 12:15:45AM +0100, magnum wrote: > 2011-11-19 18:43, Solar Designer wrote: > > That said, latest -jumbo includes OpenMP parallelization for these > > hashes, which you may make use of. This gives me 215k c/s on an > > 8-core (dual Xeon E5420 2.5 GHz). > > What if that had been a single E5520 instead (with "4 cores, 8 > threads"). Can that one run 8xSSE just like that? It would need to run 8 threads for optimal performance (unless HT is disabled), but it would not reach the speed of a real 8-core machine, indeed. It might be just slightly faster than quad-core without HT. And if you choose to run only 4 threads, but have HT enabled (so have 8 logical CPUs), you might get a performance slightly worse than that of a quad-core (or of the same CPU with HT disabled), because the kernel might not always assign your threads to logical CPUs optimally (that is, to different physical cores), although it normally tries to (in modern operating systems, which are SMT-aware). > I have always assumed > (for no particular reason) that such a CPU would only have 4 SSE > "engines" but I'm not sure what to google to find out for sure. Maybe it > depends on CPU model? Or maybe this question doesn't even make sense? The number of execution units (not only SSE, but in general) is one thing, and the number of logical CPUs per core is another. As you're aware, even with one thread we're often trying to process several data streams in parallel to maximize instruction-level parallelism and hide latencies. This is generally limited by the number of registers and by L1 instruction cache size. By also running more than one thread per core in a CPU supporting that in hardware, we make use of the extra register file(s) to keep our execution units busy even when they would have been stalled on a data dependency or memory access in one thread. This makes sense even when we have just one execution unit, as long as some instructions would potentially get stalled (such as waiting for the result of the previous instruction, if it's not yet available on the very next clock cycle). This also explains why unoptimal code benefits from SMT more. > If there is indeed such a limitation in HT, a couple more questions: > > 1. What happens if you overbook SSE? Really bad performance? Nothing bad happens, the combined performance of the threads running on one core is simply not much higher than that of a single thread. In other words, while the individual threads in fact normally become almost twice slower each, the combined performance is nevertheless usually slightly improved compared to that of a single thread running on a CPU core alone. > 2. Is there a way to easily "detect" HT in our OMP init code, so we can > limit at the number of cores instead There's no need. This would hurt performance. > (I assume default OMP_NUM_THREADS will be 8 for an E5520)? Yes, and that's as desired. There are some cases where lowering the number of threads is beneficial, such as when the algorithm doesn't scale well enough or the system is under other load, but SMT by itself is not it. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.