Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 30 Jul 2015 09:45:19 -0400
From: Rich Felker <>
Subject: Re: New optimized normal-type mutex?

On Thu, Jul 30, 2015 at 10:07:34AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 29.07.2015, 20:10 -0400 schrieb Rich Felker:
> > On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> > > Hm, could you be more specific about where this hurts?
> > > 
> > > In the code I have there is
> > > 
> > >         for (;val & lockbit;) {
> > >           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
> > >           val = atomic_load_explicit(loc, memory_order_consume);
> > >         }
> > > 
> > > so this should be robust against spurious wakeups, no?
> > 
> > The problem is that futex_wait returns immediately with EAGAIN if
> > *loc!=val, which happens very often if *loc is incremented or
> > otherwise changed on each arriving waiter.
> Yes, sure, it may change. Whether or not this is often may depend, I
> don't think we can easily make a quantitative statement, here.

The same happened with the obvious implementation of cond vars, having
a waiter store its tid then wait on *val==tid with futex_wait. The
ratio of spurious wakes to intended wakes was something like 10:1. So
this is not a purely theoretical problem.

> In the case of atomics the critical section is extremely short, and
> the count, if it varies so much, should have a bit stabilized during
> the spinlock phase before coming to the futex part. That futex part is
> really only a last resort for the rare case that the thread that is
> holding the lock has been descheduled in the middle.

Spinning is near-useless whenever you have more contenders than cores.
It's an optimization for a special case (fewer contenders than cores)
but you don't want to rely on it.

> My current test case is having X threads hammer on one single
> location, X being up to some hundred. On my 2x2 hyperthreaded CPU for
> a reasonable number of threads (X = 16 or 32) I have an overall
> performance improvement of 30%, say, when using my version of the lock
> instead of the original musl one. The point of inversion where the
> original musl lock is better is at about 200 threads.

Interesting. I suspect this would change quickly if you increased the
amount of work done with the lock held (e.g. to more closely mimic
malloc). The nasty thing is that, as soon as you _do_ cause futex_wait
to start happening rather than just spinning, performance blows up
because each thread that needs to wait calls futex_wait several times
before it succeeds, taking up cpu time where the thread that holds the
lock could be running.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.