musl - Re: spin strategy in _

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1440970685.693.23.camel@inria.fr>
Date: Sun, 30 Aug 2015 23:38:05 +0200
From: Jens Gustedt <jens.gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: Re: spin strategy in __wait

Am Sonntag, den 30.08.2015, 13:18 -0400 schrieb Rich Felker:
> On Sun, Aug 30, 2015 at 05:31:39PM +0200, Jens Gustedt wrote:
> > Am Samstag, den 29.08.2015, 20:57 -0400 schrieb Rich Felker:
> > > On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> > > > Now the only situation I thought of where this could be important is
> > > > monoprocessors where actually spinning might not be so good and
> > > > aborting it early good be necessary. So I rand the same test with
> > > > taskset to nail the process to just one core. The result of that are
> > > > the top two curves. As you can see, here the spinning strategy has
> > > > almost no influence so I think we are safe to apply this patch.
> > > 
> > > This is very surprising to me, which further makes me doubt the test
> > > methodology. The current number of spins is tuned to be comparable in
> > > magnitude to the syscall cost, at least on x86.
> > 
> > On my machine, x86_64, with are quite recent kernel it is not, it is a
> > factor of 10 off. As I said, I observe a factor of 10 between a spin
> > and a syscall, and not 100 as the existing code suggests.
> > 
> > And, BTW, I also observed that the call to "a_spin" doesn't have much
> > effect on my machine, neither positive nor negative.
> > 
> > > Adding this extra cost
> > > to each contended lock operation (you basically never stop spinning
> > > early on single-core) should produce a measurable difference.
> > 
> > No, on single core you have only one thread active at any time, and
> > the probability that it is descheduled in the middle when holding the
> > lock is tiny. So basically nobody is ever spinning as long as the time
> > that is spend in the critical section is small. (And you may have
> > sensible outlayers, once a thread is descheduled in the middle.)
> 
> Consider what happens when the thread holding the lock is preempted.
> 
> If there is no spinning, the futex_wait syscall should happen
> immediately and cause the lock-holder to be scheduled again.
> 
> If there is spinning, you have a 10x syscall-time (by your
> measurement) additional delay before the lock-holder gets scheduled
> again.
> 
> I'm failing to see why this does not have a significant effect on
> performance.

I think that for the average performance this has no effect because
the probability that such an even occurs is really small, the number
of exceptional cases is several orders of magnitude less than the
"normal" case. So you simply can't see it on average.

Second, the work that is done by spinning is still considerably less
in that benchmark than the "real" work that is done inside the
critical section(s).

So statistically, you just can't see it.

With the <stdatomic.h> code I did experiments that randomly preemted
threads in the middle of the critical section. The results show that
everything stays quiet stable even with a substantial share of such
preemtions.

> Maybe the kernel just does a good job of optimizing for
> not preempting a lock-holder. This optimization would naturally happen
> if the kernel chooses to and a task's timeslice early when it makes a
> syscall with a trivial amount of time remaining to run after the
> syscall would return, which is a good optimization to make because
> it's cheaper to switch tasks when you're already in kernelspace than
> via a timer interrupt a few microseconds later.

interesting theory

The only thing that I can deduce so far, is that these things are much
more stable than I would have expected. Linux is a marvelous tool.

> > > > Now all of this can also be read as a performance test of the malloc
> > > > subsystem, and here my feeling goes in the direction that Rich
> > > > recently indicated. The performance of the "application" is much
> > > > better if I eliminate all parallelism. As an additional indication
> > > > there are to additional curves that fix the process to one core and
> > > > its hyperthreaded sibling.
> > > > 
> > > > So maybe we might be better off with a simpler malloc strategy that
> > > > serializes all requests.
> > > 
> > > I tried this a few weeks ago and it performed significantly worse on
> > > my malloc stress tests. And it will get even worse as we try to scale
> > > to larger numbers of cores, or worse yet NUMA...
> > 
> > Did you try running your malloc stress test with taskset to nail the
> > process to one core?
> 
> No. I could do that, but I don't think there's any sense in optimizing
> for multi-threaded load on single-core at the expense of multi-core
> performance.

This was not my idea, and that is not what is going on. On the
single-core the performance *is* already much, much better than on
multi-core. So you did that optimization already :)

In a perfect world, a multi-threaded malloc stress test that is bound
to one core should drop performance, not increase it. If there is no
parallel gain with the malloc implementation one should just be better
off with a serialized malloc.

I'd just like to know what's going on. It would be good to have a
second data point.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.