[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Mon, 19 Oct 2015 16:51:02 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: SMT (was: SHA-1 H())
Lei,
I just came across a recently posted article on this very topic:
performance scaling with POWER8's SMT (albeit in context of the
different reporting on AIX vs. Linux):
http://www.ibm.com/developerworks/library/l-processor-utilization-difference-aix-lop-trs/index.html
"Simultaneous multithreading (SMT) performance characterization shown in
Figure 6 is taken from the IBM POWER8 specification. This figure shows
that SMT8 provides 2.2 times better performance compared to single
threaded on POWER8."
The article also mentions that "a single-threaded application" run "on
an IBM POWER7 SMT4 system" "shows the core utilization as approximately
63% to 65%".
So the expected speedup when going from 1 thread/core to 8 threads/core
on POWER8 is 2.2 times, and the expected speedup when going from 1
thread/core to 4 threads/core on POWER7 is 1.5 to 1.6 times. Of course,
actual speedup will vary by application.
Alexander
P.S. I don't normally top-post, but it's one of those rare cases where I
find this appropriate - needing to quote a lot of context, yet not
wanting to keep it above the new content. So here goes:
On Sat, Sep 12, 2015 at 12:57:45PM +0300, Solar Designer wrote:
> On Sat, Sep 12, 2015 at 04:53:42PM +0800, Lei Zhang wrote:
> > On my laptop, where each core supports 2 hardware threads, running 2 threads gets a 2x speedup compared to 1 thread on the same core.
>
> This happens, but it's not very common. Usually, speedup from running 2
> threads/core is much less than 2x.
>
> > OTOH, each Power8 core supports up to 8 hardware threads, so I'd expect a higher speedup than just 2x.
>
> SMT isn't only a way to increase resource utilization of a core when
> running many threads. It's also a way to achieve lower latency due to
> fewer context switches in server workloads (with lots of concurrent
> requests) and to allow CPU designers to use higher instruction latencies
> and achieve higher clock rate. (Note that my two uses of the word
> latency in the previous sentence refer to totally different latencies:
> server response latency on the order of milliseconds may be improved,
> but instruction latency on the order of nanoseconds may be harmed at the
> same time.) Our workload uses relatively low latency instructions:
> integer only, and with nearly 100% L1 cache hit rate. Some other
> workloads like multiplication of large matrices (exceeding L1 data
> cache) might benefit from more hardware threads per core (or explicit
> interleaving, but that's uncommon in scientific workloads except through
> OpenCL and such), and that's also a reason for Power CPU designers to
> support and possibly optimize for more hardware threads per core.
>
> Finally, SMT provides middle ground between increasing the number of
> ISA-visible CPU registers (which is limited by instruction size and the
> number of register operands you can encode per instruction, as well as
> by the need to maintain compatibility) and increasing the number of
> rename registers. With SMT, there are sort of more ISA-visible CPU
> registers: total across the many hardware threads. Those registers are
> as good as ISA-visible ones for the purpose of replacing the need to
> interleave instructions within 1 thread, yet they don't bump into
> instruction size limitations.
>
> I expect that on a CPU with more than 2 hardware threads the speed
> growth with the increase of threads/core in use is spread over the 1 to
> max threads range. So e.g. the speedup at only 2 threads on an 8
> hardware threads CPU may very well be less than the speedup at 2 threads
> on a 2 hardware threads CPU. I don't necessarily expect that the
> speedup achieved at max threads is much or any greater than that
> achieved at 2 threads on a CPU where 2 is the max. There's potential
> for it to be greater (in the sense that the thread count doesn't limit
> it to at most 2), but it might or might not be greater in practice.
>
> Alexander
Powered by blists - more mailing lists
Powered by Openwall GNU/*/Linux -
Powered by OpenVZ