Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 05 Sep 2014 11:27:52 +0200
From: Jens Gustedt <jens.gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: Re: In-progress stuff, week of Sept 1

Am Freitag, den 05.09.2014, 04:32 -0400 schrieb Rich Felker:
> On Fri, Sep 05, 2014 at 09:41:42AM +0200, Jens Gustedt wrote:
> > Hello,
> > 
> > Am Donnerstag, den 04.09.2014, 21:14 -0400 schrieb Rich Felker:
> > > First, things that definitely are going in this release cycle:
> > > 
> > > - Jens Gustedt's C11 threads: From discussion it's ready, but the last
> > >   iteration of the patches added more gratuitous changes at the same
> > >   time it removed others. Jens is going to be less available for a
> > >   while, so I'm going to finish getting the patches ready to commit,
> > >   but it's going to take a little work going back and reviewing the
> > >   email threads to determine what was the outcome of discussion versus
> > >   what's new, and among what's new, whether it's essential or side
> > >   changes orthogonal to adding C11 threads.
> > 
> > Sorry if that left that impression to you. There is only one thing I
> > remember, that is not really sorted out in the last patch series I
> > sent, the timespec_get things. The other patches should be relatively
> > clean, no? In particular, the higher numbered patches (I think it was
> > 8 and 9) can just be omitted, if you don't like them.
> > 
> > In any case, just send me requests for diff's or entire files in that
> > or that version, my git history contains them all.
> 
> I think it's mainly the __clock_gettime changes that crept well
> outside the scope of doing C11 threads, but I need to look again.
> Trying to get some other things committed at the moment.

So tell me anything you need in terms of patches that might ease your
task.

> > I have other points that I delayed for later. Off my head:
> > 
> >  - review atomics
> > 
> >    * This starts by adding some "i" constraints to the assembler. Gcc is
> >    perfectly capable to optimize static inlined asm that has "ir"
> >    constraints. In the code a lot of these are used with constants 0
> >    or 1, this would shave of the use of a register at all these
> >    points.
> 
> Examples? We already use "ir" for syscalls on some archs, and I had
> good results. But I wasn't aware that there were places in the atomic
> code where "i" could help.

On i386 and friends these are all those that do "and", "or", or
"store". All of these could get "ir", I think. `a_and` has 1 usage in
src/, `a_or` has 4, and `a_store` has 26. For the later, all but 4 are
with constants.

> >    * Review the necessity of having all of these with "memory". As long
> >    as the pointer argument is constrained with "m" and not "p", this
> >    should suffice to encode the data dependency. The only thing that
> >    we'd have to be careful with is reordering, but I think this could
> >    be achieved by using `volatile`, instead.
> 
> In order for their memory-barrier semantics to be useful, all atomics
> also need to be full compiler barriers. Otherwise the compiler could
> move loads/stores of data it thinks is unrelated across the asm block,
> and thereby across the memory barrier.

Which would be ok if we'd allow for acquire-release semantics. So this
is related to the discussion below.

> >    * Move to gcc's __sync and __atomic builtins. The first are available
> >    since always, and both are easily testable with macros. In case of
> >    __atomic this would make it possible to move the atomics used in
> >    musl to acquire-release semantics, which may pay for some archs, in
> >    particular arm, I think.
> > 
> >    This also should make it easier to port to new architectures in the
> >    future, as soon as there is a working gcc for it.

> This has already been discussed in the past, and it's not a good
> option:
> 
> - It would severely reduce the set of compilers that can be used to
>   build musl.

Probably I am too naive on that part. We are using gcc extensions in
many places, so I thought that all compilers that can be used with
musl basically implement gcc's major features.

As these extensions are easily testable, so having this as an option
would still be doable.

> - It makes it impossible to correctly support archs where different
>   atomic strategies might need to be chosen at runtime. This includes
>   both arm (see the long threads on what to do about broken grsec
>   kernels without the kuser helper page) and sh, and maybe more. Even
>   if the compiler could be taught to solve this problem, we'd then be
>   bumping the compiler requirement to gcc 5.0 or something, which
>   would be completely unacceptable.

sure that for such cases this must be easily tunable

> - POSIX requires seq_cst semantics for everything anyway, so the
>   benefit would only be for C11 sync objects, and only if we write new
>   implementations of them using relaxed atomics.

right

> I'd like to pursue an approach with relaxed atomics at some point,
> probably with asm still rather than intrinsics, but IMO it's rather
> low priority. I'd like to see the whole C11/POSIX alignment matter get
> settled first so we're not shooting for a moving target.

sure, I completely agree

> >  - implement a simple lock feature based on futex that encodes the
> >    wait counter inside the `int`.
> > 
> >    After the discussion with Rich some weeks ago I gave it a thought,
> >    and I think this can be simply done by using the parity bit for the
> >    lock and the other bits for the waiters count. On i386 the parity
> >    bit (as any other bit) can effectively accessed atomically with
> >    "lock ; btsl ".
> > 
> >    IIRC, having such a simple lock would also help as a third step of
> >    the changes to cv that I proposed and that you mention above.
> 
> I'm not sure what you mean,

again, out of the head, i remember two places that have "adhoc"
locking conventions that we touched in the last weeks, cv and
pthread_once_t. Both encode waiters in a non-optimal way: cv only has
an estimate of "somebody might be waiting", pthread_once_t uses a
static `waiters` variable for all instances.

> but it's a bad idea to have the waiters
> count inside the futex int. This is because every time the futex int
> changes, you risk FUTEX_WAIT returning with EAGAIN. Under high load of
> waiters arriving/leaving, it might be impossible to _ever_ start a
> wait; all threads may just spin. The futex int should be something
> that doesn't change except for a very small finite number of changes
> (like adding a waiter flag, not a waiter count) before you actually
> have a chance of successfully getting the lock or whatever.

First, I think you overestimate the trouble that we can have with
that. A thread that activates the wait count and goes into the futex
call must be active. If the futex_wait is not possible EAGAIN is in
general returned without descheduling this thread. So the number of
threads spinning around could reduced to be at max the number of
cores. (check for EAGAIN, reload the `int` without changing it and
retry).  This might drag on performance in some extreme cases, but
having an accurate wait count might benefit in others.

In any case, for the two use cases I mention above this problem isn't
much relevant, I think. For cv this is the internal lock, there will
be at most one of "waiting" threads on that (he has to hold the mutex)
and perhaps several signaling threads, but it should be rare that this
is contended by more than two threads at a time. For the current
version the "may-have-waiters" flag may trigger some superfluous wake
up, I think.

And pthread_once_t is of minor performance importance, anyhow.

But it is a pity that there is no FUTEX_WAIT_OP or something like
that, that would allow to test for other things than equality.

> >  - __timedwait translates absolute time to relative time, by taking
> >    the current time with clock_gettime. This could be avoided for
> >    CLOCK_REALTIME and CLOCK_MONOTONIC when using FUTEX_WAIT_BITSET
> >    instead of FUTEX_WAIT. That one uses absolute time from the
> >    start. Since C threads don't leave a chose for the clock they are
> >    using this would avoid using this code path completely and shave of
> >    one system call.
> 
> IIRC I tried to do that at one point but then decided against it. That
> was several years back, so it might be worth revisiting now, but keep
> in mind clock_gettime is expected to be a vdso call (no syscall) on
> first-class archs so it's not as big an issue as you might think.x

Sure, all of this has low priority.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



Download attachment "signature.asc" of type "application/pgp-signature" (199 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.