john-users - Re: OMP vs SSE

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20111121234845.GA28591@openwall.com>
Date: Tue, 22 Nov 2011 03:48:45 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: OMP vs SSE

On Tue, Nov 22, 2011 at 12:15:45AM +0100, magnum wrote:
> 2011-11-19 18:43, Solar Designer wrote:
> > That said, latest -jumbo includes OpenMP parallelization for these
> > hashes, which you may make use of.  This gives me 215k c/s on an
> > 8-core (dual Xeon E5420 2.5 GHz).
> 
> What if that had been a single E5520 instead (with "4 cores, 8
> threads"). Can that one run 8xSSE just like that?

It would need to run 8 threads for optimal performance (unless HT is
disabled), but it would not reach the speed of a real 8-core machine,
indeed.  It might be just slightly faster than quad-core without HT.

And if you choose to run only 4 threads, but have HT enabled (so have 8
logical CPUs), you might get a performance slightly worse than that of a
quad-core (or of the same CPU with HT disabled), because the kernel
might not always assign your threads to logical CPUs optimally (that is,
to different physical cores), although it normally tries to (in modern
operating systems, which are SMT-aware).

> I have always assumed
> (for no particular reason) that such a CPU would only have 4 SSE
> "engines" but I'm not sure what to google to find out for sure. Maybe it
> depends on CPU model? Or maybe this question doesn't even make sense?

The number of execution units (not only SSE, but in general) is one
thing, and the number of logical CPUs per core is another.

As you're aware, even with one thread we're often trying to process
several data streams in parallel to maximize instruction-level
parallelism and hide latencies.  This is generally limited by the number
of registers and by L1 instruction cache size.  By also running more
than one thread per core in a CPU supporting that in hardware, we make
use of the extra register file(s) to keep our execution units busy even
when they would have been stalled on a data dependency or memory access
in one thread.  This makes sense even when we have just one execution
unit, as long as some instructions would potentially get stalled (such
as waiting for the result of the previous instruction, if it's not yet
available on the very next clock cycle).

This also explains why unoptimal code benefits from SMT more.

> If there is indeed such a limitation in HT, a couple more questions:
> 
> 1. What happens if you overbook SSE? Really bad performance?

Nothing bad happens, the combined performance of the threads running on
one core is simply not much higher than that of a single thread.  In
other words, while the individual threads in fact normally become almost
twice slower each, the combined performance is nevertheless usually
slightly improved compared to that of a single thread running on a CPU
core alone.

> 2. Is there a way to easily "detect" HT in our OMP init code, so we can
> limit at the number of cores instead

There's no need.  This would hurt performance.

> (I assume default OMP_NUM_THREADS will be 8 for an E5520)?

Yes, and that's as desired.

There are some cases where lowering the number of threads is beneficial,
such as when the algorithm doesn't scale well enough or the system is
under other load, but SMT by itself is not it.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.