musl - Re: spin strategy in _

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150830171837.GO7833@brightrain.aerifal.cx>
Date: Sun, 30 Aug 2015 13:18:37 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: spin strategy in __wait

On Sun, Aug 30, 2015 at 05:31:39PM +0200, Jens Gustedt wrote:
> Am Samstag, den 29.08.2015, 20:57 -0400 schrieb Rich Felker:
> > On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> > > The question that I asked myself is how much spinning to we need in
> > > such lock/wait primitives. In the first attached graphic, the three
> > > bottom curves show what difference three spinning strategies can make
> > > for this test on a 2x2 hyperthreaded core machine. The bottom is no
> > > spinning at all, the next is the current strategy implemented in musl,
> > > and the third is a modification of that. As you can see, for a high
> > > load of threads they can make a substantial difference, but you also
> > > see that musl's actually strategy is not very different from doing no
> > > spinning at all.
> > 
> > That's intentional when you have waiters. Notice in your graph that
> > for the range 2-4 threads, performance in musl is comparable to your
> > "no-shortcut" approach. This shows that the spinning is working as
> > intended.
> 
> In that range, yes. When the machine is overcommitted with, say, 2 times
> more threads than the number of cores, it is significantly off.

I said "as intended". The intent is that no spinning happen when there
are more threads than cores. You're free to claim this is suboptimal
for raw performance, but it's the intended behavior.

> > On the other hand, are you even counting the number of times each
> > individual thread performs a malloc/free cycle, or only the total
> > across all threads? Is it possible that, among your 256 threads, the
> > vast majority of malloc/free cycles are being performed by a small
> > number and the rest are stuck repeating futex_wait over and over?
> 
> I now also collected these statistics, and, no, all threads get their
> share. I attach to data files (tab separated values) that show
> this. These files have one line per run (10 sec each), with the
> following columns
> 
> number of threads
> average number per thread of calls to malloc/free during the 10 secs
> standard deviation of that quantity
> the min value over all threads
> the max value over all threads
> 
> As you may see, the variation is the biggest around 3-4 threads, but
> stays reasonable, I think, from overall min to overall max there is a
> difference of 50% or so. Nobody is blocked.
> 
> This variation then gets really small for up to 64 threads. For 128
> and 256 it is a bit bigger, but overall the numbers are a bit too
> small for that cases and one should perhaps run the test for more than
> 10 secs.

Interesting.

> > > Now the only situation I thought of where this could be important is
> > > monoprocessors where actually spinning might not be so good and
> > > aborting it early good be necessary. So I rand the same test with
> > > taskset to nail the process to just one core. The result of that are
> > > the top two curves. As you can see, here the spinning strategy has
> > > almost no influence so I think we are safe to apply this patch.
> > 
> > This is very surprising to me, which further makes me doubt the test
> > methodology. The current number of spins is tuned to be comparable in
> > magnitude to the syscall cost, at least on x86.
> 
> On my machine, x86_64, with are quite recent kernel it is not, it is a
> factor of 10 off. As I said, I observe a factor of 10 between a spin
> and a syscall, and not 100 as the existing code suggests.
> 
> And, BTW, I also observed that the call to "a_spin" doesn't have much
> effect on my machine, neither positive nor negative.
> 
> > Adding this extra cost
> > to each contended lock operation (you basically never stop spinning
> > early on single-core) should produce a measurable difference.
> 
> No, on single core you have only one thread active at any time, and
> the probability that it is descheduled in the middle when holding the
> lock is tiny. So basically nobody is ever spinning as long as the time
> that is spend in the critical section is small. (And you may have
> sensible outlayers, once a thread is descheduled in the middle.)

Consider what happens when the thread holding the lock is preempted.

If there is no spinning, the futex_wait syscall should happen
immediately and cause the lock-holder to be scheduled again.

If there is spinning, you have a 10x syscall-time (by your
measurement) additional delay before the lock-holder gets scheduled
again.

I'm failing to see why this does not have a significant effect on
performance. Maybe the kernel just does a good job of optimizing for
not preempting a lock-holder. This optimization would naturally happen
if the kernel chooses to and a task's timeslice early when it makes a
syscall with a trivial amount of time remaining to run after the
syscall would return, which is a good optimization to make because
it's cheaper to switch tasks when you're already in kernelspace than
via a timer interrupt a few microseconds later.

> > > Now all of this can also be read as a performance test of the malloc
> > > subsystem, and here my feeling goes in the direction that Rich
> > > recently indicated. The performance of the "application" is much
> > > better if I eliminate all parallelism. As an additional indication
> > > there are to additional curves that fix the process to one core and
> > > its hyperthreaded sibling.
> > > 
> > > So maybe we might be better off with a simpler malloc strategy that
> > > serializes all requests.
> > 
> > I tried this a few weeks ago and it performed significantly worse on
> > my malloc stress tests. And it will get even worse as we try to scale
> > to larger numbers of cores, or worse yet NUMA...
> 
> Did you try running your malloc stress test with taskset to nail the
> process to one core?

No. I could do that, but I don't think there's any sense in optimizing
for multi-threaded load on single-core at the expense of multi-core
performance.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.