musl - Re: Multi-threaded performance progress

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140826213625.GJ12888@brightrain.aerifal.cx>
Date: Tue, 26 Aug 2014 17:36:25 -0400
From: Rich Felker <dalias@...c.org>
To: Jens Gustedt <jens.gustedt@...ia.fr>
Cc: musl@...ts.openwall.com
Subject: Re: Multi-threaded performance progress

On Tue, Aug 26, 2014 at 11:15:56PM +0200, Jens Gustedt wrote:
> > > ok, it shouldn't be difficult to use atomic ops, then.
> > 
> > Based on what you've said below, though, I think there's still a big
> > problem..
> 
> I am not sure that I follow.

OK, I think I misunderstood things you said before.

> > > > Does your code eliminate that all the time? If so it's more
> > > > interesting. I'll look at it more.
> > > 
> > > Yes, it will be the signaling or broadcasting thread that will be
> > > working on the integrity of the list while it is holding the lock. At
> > > the end those that it detected to be leaving will be isolated list
> > > items, those that are to be signaled form one consecutive list that is
> > > detached from the cv, and the ones that remain (in the signaling case)
> > > form a valid cv-with-list.
> > > 
> > > The only small gap that remains (and that annoys me) is the leaving
> > > thread that sneaks in
> > > 
> > >  - marks itself as leaving before the end of the  the CS
> > >  - only asks for _c_lock *after* the signaling thread has left its CS
> > > 
> > > This is all our problem of late access to the cv variable revisited,
> > > but here it is condensed in a very narrow time frame. Both threads
> > > must be active for this to happen, so my hope is that when both are
> > > spinning for some time on the  c_lock for the waiter and on ref for
> > > the signaler, none of them will "ever" be forced into a futex wait.
> > 
> > That's a bug thast needs to be fixed to go forward with this, since
> > it's essentially a use-after-free. Now that you mention it, avoiding
> > use-after-free was one of my main motivations for having such waiters
> > synchronize with the signaling thread. That should have been
> > documented in a comment somewhere, but the point seems to have slipped
> > my mind sometime between the design phase and writing the code and
> > comments.
> 
> here I don't follow either.
> 
> What I describe is an inconvenience that forces all these mechanics
> with the signaller still waiting for the ref count to fall to 0 before
> he can leave and abandon the cv. As far as I can see that should still
> be working with both versions.

OK, I misinterpreted what you were saying as a statement that you had
somehow optimized out this waiting, while in reality you seem to have
been saying that there's still this window where the wait is required.
Is this correct?

> > I'm not seeing any solution to that problem.
> 
> You mean in the code that I proposed? No there is no other solution,
> than what you have in your original version: have the signalling
> thread wait for all the leaving waiters to come in. In your version
> that write to node->notify is protected by _c_lock, so this is not
> subject to races.
> 
> In my version this is the same, unless the lockRace variant gives the
> leaving thread some additional information. If this information about
> ref arrives in time before the _c_lock is taken it is used, and the
> leaving thread never touches the cv, again. If it doesn't arrive in
> time the _c_lock is waited for, then taken, the notifier is
> re-evaluated, and all is fine. (__wait should be a memory barrier,
> shouldn't it?)

OK, it sounds plausible that this works. But it is just moving (and
hopefully shortening) the race window, not eliminating it.

> > I'm also still skeptical that there's a problem to be solved here; for
> > it to matter, the incidence of such races needs to be pretty high, I
> > think.
> 
> "matter" can mean different things. I don't think that it my version
> penalizes performance (but sure a proof would be better), it could
> also give some small benefits concerning execution time, it could be
> mildly better by avoiding cache-ping-pong and by avoiding some futex
> calls.
> 
> For me, it matters because the handling is "simpler" (a proof would be
> much, much better, here :) in that there is one single thread
> manipulating the list. So I should work more on prooving that point by
> keeping it much closer to your original version.

As for whether it's simpler, I'm still not convinced, but let's see.

BTW I just found one bug:

+               if (a_fetch_add(&ref[0], -1)==1 && ref[1])
+                       __wake(&ref[0], 1, 1);

This accesses the signaling thread's stack (ref[1]) after releasing
the signaling thread to return.

Fixing it should be trivial via the design I mentioned earlier: don't
use a waiter flag like this, but instead offset the initial value of
ref by +1 and a_dec it just before waiting. As in other places, of
course, a wake to an invalid address is possible either way; this is
"fixable" if necessary via FUTEX_WAKE_OP (having the kernel do the
atomic dec after acquiring the futex bin locks).

> Ok, I'll try to switch to one-step-at-a-time-mode.

OK.

> > Or, if you don't have time to spend on side projects like test
> > cases, maybe someone else could test it?
> 
> Not at the moment. I would prefer to have a C thread implementation by
> the end of the week that we could agree upon.

Yes, that would be very nice!

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.