Date: Tue, 26 Aug 2014 13:32:18 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Multi-threaded performance progress On Tue, Aug 26, 2014 at 06:35:19PM +0200, Jens Gustedt wrote: > Am Dienstag, den 26.08.2014, 09:04 +0200 schrieb Jens Gustedt: > > Am Montag, den 25.08.2014, 23:43 -0400 schrieb Rich Felker: > > > This release cycle looks like it's going to be huge for multi-threaded > > > performance issues. So far the cumulative improvement on my main > > > development system, as measured by the cond_bench.c by Timo Teräs, is > > > from ~250k signals in 2 seconds up to ~3.7M signals in 2 seconds. > > > That's comparable to what glibc gets on similar hardware with a cond > > > var implementation that's much less correct. The improvements are a > > > result of adding private futex support, redesigning the cond var > > > implementation, and improvements to the spin-before-futex-wait > > > behavior. > > > > Very impressive! > > I reviewed the new pthread_cond code closely and found it to be really > rock solid. > > I have some minor things, that might still improve things (or > not). They make the code a bit longer, but they attempt to gain things > here and there: > > - Tighten the lock on _c_lock such that the critical section > contains the least necessary. Do you see any major opportunities for this? For the critical section in pthread_cond_timedwait, a few operations could be moved out before it, but they're all trivial assignments. As for __private_cond_signal, it's currently necessary that all its modifications to the list be made before either the cv lock or the in-waiter-node barrier lock is released, because any waiters which win the race and enter LEAVING status rather than SIGNALED status use the cv lock to proceed. Perhaps this could be changed, though... > - Have all the update of the list of waiters done by the signaling > or broadcasting thread. This work is serialized by the lock on the > cv, anyhow, so let the main work be done by a thread that already > holds the lock and is scheduled. The problem I ran into was that the unwait operation can be from cancellation or timeout, in which case the waiter has to remove itself from the list, and needs to obtain the cv lock to do this. And it's not clear to me how the waiter can know that the signaling thread is taking responsibility for removing it from the list without synchronizing with the signaling thread like it does now. In any case the costly synchronization here only happens on hopefully-very-rare races. > - In case of broadcast, work on head and tail of the list > first. These are the only ones that would change the _c_head and _c_tail > entries of the cv. But we can't release the lock anyway until all waiter states have been atomic cas'd, or at least doing so doesn't seem safe to me. > - Try to reduce the number of futex calls. Threads that are leaving > don't have to regain the lock when there is known contention with a > signaler, now that the signaler is doing the main work in that > case. How do they detect this contention? If they won the race and changed state to LEAVING, they don't see the contention. If they lose the race, they become SIGNALED, and thus take the barrier path rather than the cv-lock path. > Also only wake up the signaling thread at the end when he is known > to be inside a futex call. I think this could be achieved trivially by having ref start at 1 rather than 0, and having the signaling thread a_dec ref just before going into the maybe-wait loop. Then the waiters won't send the futex wake unless the signaler reached the a_dec already, since they won't see the hitting-zero step. > There are perhaps other possibilities, like doing some spinning in > "lock" before going into __wait. The __wait function has a built-in spin. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.