Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Mon, 19 Oct 2015 16:51:02 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: SMT (was: SHA-1 H())

Lei,

I just came across a recently posted article on this very topic:
performance scaling with POWER8's SMT (albeit in context of the
different reporting on AIX vs. Linux):

http://www.ibm.com/developerworks/library/l-processor-utilization-difference-aix-lop-trs/index.html

"Simultaneous multithreading (SMT) performance characterization shown in
Figure 6 is taken from the IBM POWER8 specification.  This figure shows
that SMT8 provides 2.2 times better performance compared to single
threaded on POWER8."

The article also mentions that "a single-threaded application" run "on
an IBM POWER7 SMT4 system" "shows the core utilization as approximately
63% to 65%".

So the expected speedup when going from 1 thread/core to 8 threads/core
on POWER8 is 2.2 times, and the expected speedup when going from 1
thread/core to 4 threads/core on POWER7 is 1.5 to 1.6 times.  Of course,
actual speedup will vary by application.

Alexander

P.S. I don't normally top-post, but it's one of those rare cases where I
find this appropriate - needing to quote a lot of context, yet not
wanting to keep it above the new content.  So here goes:

On Sat, Sep 12, 2015 at 12:57:45PM +0300, Solar Designer wrote:
> On Sat, Sep 12, 2015 at 04:53:42PM +0800, Lei Zhang wrote:
> > On my laptop, where each core supports 2 hardware threads, running 2 threads gets a 2x speedup compared to 1 thread on the same core.
> 
> This happens, but it's not very common.  Usually, speedup from running 2
> threads/core is much less than 2x.
> 
> > OTOH, each Power8 core supports up to 8 hardware threads, so I'd expect a higher speedup than just 2x.
> 
> SMT isn't only a way to increase resource utilization of a core when
> running many threads.  It's also a way to achieve lower latency due to
> fewer context switches in server workloads (with lots of concurrent
> requests) and to allow CPU designers to use higher instruction latencies
> and achieve higher clock rate.  (Note that my two uses of the word
> latency in the previous sentence refer to totally different latencies:
> server response latency on the order of milliseconds may be improved,
> but instruction latency on the order of nanoseconds may be harmed at the
> same time.)  Our workload uses relatively low latency instructions:
> integer only, and with nearly 100% L1 cache hit rate.  Some other
> workloads like multiplication of large matrices (exceeding L1 data
> cache) might benefit from more hardware threads per core (or explicit
> interleaving, but that's uncommon in scientific workloads except through
> OpenCL and such), and that's also a reason for Power CPU designers to
> support and possibly optimize for more hardware threads per core.
> 
> Finally, SMT provides middle ground between increasing the number of
> ISA-visible CPU registers (which is limited by instruction size and the
> number of register operands you can encode per instruction, as well as
> by the need to maintain compatibility) and increasing the number of
> rename registers.  With SMT, there are sort of more ISA-visible CPU
> registers: total across the many hardware threads.  Those registers are
> as good as ISA-visible ones for the purpose of replacing the need to
> interleave instructions within 1 thread, yet they don't bump into
> instruction size limitations.
> 
> I expect that on a CPU with more than 2 hardware threads the speed
> growth with the increase of threads/core in use is spread over the 1 to
> max threads range.  So e.g. the speedup at only 2 threads on an 8
> hardware threads CPU may very well be less than the speedup at 2 threads
> on a 2 hardware threads CPU.  I don't necessarily expect that the
> speedup achieved at max threads is much or any greater than that
> achieved at 2 threads on a CPU where 2 is the max.  There's potential
> for it to be greater (in the sense that the thread count doesn't limit
> it to at most 2), but it might or might not be greater in practice.
> 
> Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ