musl - Re: In-progress stuff, week of Sept 1

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140905164510.GM23797@brightrain.aerifal.cx>
Date: Fri, 5 Sep 2014 12:45:10 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: In-progress stuff, week of Sept 1

On Fri, Sep 05, 2014 at 11:27:52AM +0200, Jens Gustedt wrote:
> Am Freitag, den 05.09.2014, 04:32 -0400 schrieb Rich Felker:
> > On Fri, Sep 05, 2014 at 09:41:42AM +0200, Jens Gustedt wrote:
> > > Hello,
> > > 
> > > Am Donnerstag, den 04.09.2014, 21:14 -0400 schrieb Rich Felker:
> > > > First, things that definitely are going in this release cycle:
> > > > 
> > > > - Jens Gustedt's C11 threads: From discussion it's ready, but the last
> > > >   iteration of the patches added more gratuitous changes at the same
> > > >   time it removed others. Jens is going to be less available for a
> > > >   while, so I'm going to finish getting the patches ready to commit,
> > > >   but it's going to take a little work going back and reviewing the
> > > >   email threads to determine what was the outcome of discussion versus
> > > >   what's new, and among what's new, whether it's essential or side
> > > >   changes orthogonal to adding C11 threads.
> > > 
> > > Sorry if that left that impression to you. There is only one thing I
> > > remember, that is not really sorted out in the last patch series I
> > > sent, the timespec_get things. The other patches should be relatively
> > > clean, no? In particular, the higher numbered patches (I think it was
> > > 8 and 9) can just be omitted, if you don't like them.
> > > 
> > > In any case, just send me requests for diff's or entire files in that
> > > or that version, my git history contains them all.
> > 
> > I think it's mainly the __clock_gettime changes that crept well
> > outside the scope of doing C11 threads, but I need to look again.
> > Trying to get some other things committed at the moment.
> 
> So tell me anything you need in terms of patches that might ease your
> task.

If it looks like a big task when I read over it again, I'll see if
there's something you could do that would make it easier. Otherwise
it's probably easier just to do it than adding email latency. Thanks
though.

> > > I have other points that I delayed for later. Off my head:
> > > 
> > >  - review atomics
> > > 
> > >    * This starts by adding some "i" constraints to the assembler. Gcc is
> > >    perfectly capable to optimize static inlined asm that has "ir"
> > >    constraints. In the code a lot of these are used with constants 0
> > >    or 1, this would shave of the use of a register at all these
> > >    points.
> > 
> > Examples? We already use "ir" for syscalls on some archs, and I had
> > good results. But I wasn't aware that there were places in the atomic
> > code where "i" could help.
> 
> On i386 and friends these are all those that do "and", "or", or
> "store". All of these could get "ir", I think. `a_and` has 1 usage in
> src/, `a_or` has 4, and `a_store` has 26. For the later, all but 4 are
> with constants.

Only 2 uses of a_or have constant operands. But you're right that it
could be beneficial for a_store, and for those two uses of a_or. I'm
not opposed to this change.

> > >    * Review the necessity of having all of these with "memory". As long
> > >    as the pointer argument is constrained with "m" and not "p", this
> > >    should suffice to encode the data dependency. The only thing that
> > >    we'd have to be careful with is reordering, but I think this could
> > >    be achieved by using `volatile`, instead.
> > 
> > In order for their memory-barrier semantics to be useful, all atomics
> > also need to be full compiler barriers. Otherwise the compiler could
> > move loads/stores of data it thinks is unrelated across the asm block,
> > and thereby across the memory barrier.
> 
> Which would be ok if we'd allow for acquire-release semantics. So this
> is related to the discussion below.

It's almost never acceptable for the compiler to move loads/stores in
BOTH directions across the atomic, which removing the memory
constraint would allow, as far as I can tell. So I think
acquire-release semantics only give a significant compiler-level
benefit if you use the intrinsics. On the other hand, I suspect
allowing the compiler to reorder loads/stores is utterly useless for
performance inside the synchronization primitives (as opposed to
direct use of atomics by apps, where it might be very useful). The
useful part is reducing or removing the cache synchronization
requirements between cores.

> > >    * Move to gcc's __sync and __atomic builtins. The first are available
> > >    since always, and both are easily testable with macros. In case of
> > >    __atomic this would make it possible to move the atomics used in
> > >    musl to acquire-release semantics, which may pay for some archs, in
> > >    particular arm, I think.
> > > 
> > >    This also should make it easier to port to new architectures in the
> > >    future, as soon as there is a working gcc for it.
> 
> > This has already been discussed in the past, and it's not a good
> > option:
> > 
> > - It would severely reduce the set of compilers that can be used to
> >   build musl.
> 
> Probably I am too naive on that part. We are using gcc extensions in
> many places, so I thought that all compilers that can be used with
> musl basically implement gcc's major features.

While not currently documented, the set of GCC extensions used is
intentionally extremely minimal, and limited to situations where there
is absolutely no other way to represent the needed semantics. The
current code presumably works all the way back to gcc 3.0 (not tested)
works with pcc, and would work with tcc if some things not related to
gcc-compatibility (just C correctness) on tcc's side were fixed.

If I'm not mistaken, the __sync intrinsics require at least gcc 4.2
and perhaps even a later post-GPLv3 version (unacceptable to a
considerable portion of musl's users). I'm not clear whether pcc
supports them too.

> As these extensions are easily testable, so having this as an option
> would still be doable.

Options are combinatorics for testing. Unless they actually have a
huge practical benefit, they're a net liability, not a feature. IMO,
this (not a single option, but overall the abundance of them) is the
main reason uClibc is bitrotting today and musl is moving forward at a
very fast pace -- we don't have option combinatorics to maintain,
test, and constantly fight bitrot.

I'm not entirely opposed to optional support for these (e.g. arm
already has to have _runtime_ options for these, which are even worse,
but it's not yet implemented), but for the time being it wouldn't even
be a benefit since we'd be using seq_cst everywhere anyway...

> > I'd like to pursue an approach with relaxed atomics at some point,
> > probably with asm still rather than intrinsics, but IMO it's rather
> > low priority. I'd like to see the whole C11/POSIX alignment matter get
> > settled first so we're not shooting for a moving target.
> 
> sure, I completely agree

OK.

> > >  - implement a simple lock feature based on futex that encodes the
> > >    wait counter inside the `int`.
> > > 
> > >    After the discussion with Rich some weeks ago I gave it a thought,
> > >    and I think this can be simply done by using the parity bit for the
> > >    lock and the other bits for the waiters count. On i386 the parity
> > >    bit (as any other bit) can effectively accessed atomically with
> > >    "lock ; btsl ".
> > > 
> > >    IIRC, having such a simple lock would also help as a third step of
> > >    the changes to cv that I proposed and that you mention above.
> > 
> > I'm not sure what you mean,
> 
> again, out of the head, i remember two places that have "adhoc"
> locking conventions that we touched in the last weeks, cv and
> pthread_once_t. Both encode waiters in a non-optimal way: cv only has
> an estimate of "somebody might be waiting", pthread_once_t uses a
> static `waiters` variable for all instances.

Yes, this is bad, but pthread_once_t would be fine with a waiters flag
in the futex int. I'll look at improving that later. But since the
number of pthread_once_t objects in the whole program is
finite/bounded, it's really irrelevant to performance.

> > but it's a bad idea to have the waiters
> > count inside the futex int. This is because every time the futex int
> > changes, you risk FUTEX_WAIT returning with EAGAIN. Under high load of
> > waiters arriving/leaving, it might be impossible to _ever_ start a
> > wait; all threads may just spin. The futex int should be something
> > that doesn't change except for a very small finite number of changes
> > (like adding a waiter flag, not a waiter count) before you actually
> > have a chance of successfully getting the lock or whatever.
> 
> First, I think you overestimate the trouble that we can have with
> that.

Actually I don't. It follows from the massive performance problems I
had with the original cv implementation where wait was "write your own
tid then futex wait" and signal was "write your own tid then futex
wake". This is trivially correct and avoids sequence number wrapping
issues like the current pshared implementation, but it had a HUGE
number of spurious wakes: something like 3:1 (spurious:genuine) for
several threads hammering the cv. 

On typical cpus, there's a window of 500+ cycles between the atomic
operation in userspace and rechecking it in kernelspace. This is
plenty time for another core to cause your futex wait to EAGAIN.

> In any case, for the two use cases I mention above this problem isn't
> much relevant, I think. For cv this is the internal lock, there will
> be at most one of "waiting" threads on that (he has to hold the mutex)
> and perhaps several signaling threads, but it should be rare that this
> is contended by more than two threads at a time. For the current
> version the "may-have-waiters" flag may trigger some superfluous wake
> up, I think.

This is possibly true, but would need some analysis, and thinking
about it now is distracting me from getting done other things that we
want to get done now. ;-)

> And pthread_once_t is of minor performance importance, anyhow.

I don't see how a may-have-waiters approach has any cost for
pthread_once_t unless some of the waiters are cancelled. And that's a
pathological case that's not worth optimizing.

> But it is a pity that there is no FUTEX_WAIT_OP or something like
> that, that would allow to test for other things than equality.

Yes. And as I've noted before, I think it would allow for a very nice
cv implementation that's basically entirely kernel-side and that works
for process-shared cvs. (You could use it to make unlocking the mutex
truely atomic with respect to waiting on the cv futex.)

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.