Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 12 Sep 2015 12:57:45 +0300
From: Solar Designer <>
Subject: Re: SHA-1 H()


On Sat, Sep 12, 2015 at 04:53:42PM +0800, Lei Zhang wrote:
> On my laptop, where each core supports 2 hardware threads, running 2 threads gets a 2x speedup compared to 1 thread on the same core.

This happens, but it's not very common.  Usually, speedup from running 2
threads/core is much less than 2x.

> OTOH, each Power8 core supports up to 8 hardware threads, so I'd expect a higher speedup than just 2x.

SMT isn't only a way to increase resource utilization of a core when
running many threads.  It's also a way to achieve lower latency due to
fewer context switches in server workloads (with lots of concurrent
requests) and to allow CPU designers to use higher instruction latencies
and achieve higher clock rate.  (Note that my two uses of the word
latency in the previous sentence refer to totally different latencies:
server response latency on the order of milliseconds may be improved,
but instruction latency on the order of nanoseconds may be harmed at the
same time.)  Our workload uses relatively low latency instructions:
integer only, and with nearly 100% L1 cache hit rate.  Some other
workloads like multiplication of large matrices (exceeding L1 data
cache) might benefit from more hardware threads per core (or explicit
interleaving, but that's uncommon in scientific workloads except through
OpenCL and such), and that's also a reason for Power CPU designers to
support and possibly optimize for more hardware threads per core.

Finally, SMT provides middle ground between increasing the number of
ISA-visible CPU registers (which is limited by instruction size and the
number of register operands you can encode per instruction, as well as
by the need to maintain compatibility) and increasing the number of
rename registers.  With SMT, there are sort of more ISA-visible CPU
registers: total across the many hardware threads.  Those registers are
as good as ISA-visible ones for the purpose of replacing the need to
interleave instructions within 1 thread, yet they don't bump into
instruction size limitations.

I expect that on a CPU with more than 2 hardware threads the speed
growth with the increase of threads/core in use is spread over the 1 to
max threads range.  So e.g. the speedup at only 2 threads on an 8
hardware threads CPU may very well be less than the speedup at 2 threads
on a 2 hardware threads CPU.  I don't necessarily expect that the
speedup achieved at max threads is much or any greater than that
achieved at 2 threads on a CPU where 2 is the max.  There's potential
for it to be greater (in the sense that the thread count doesn't limit
it to at most 2), but it might or might not be greater in practice.


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ