musl - Re: Re: MT fork and key_lock in pthread_key

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8e6ea180429f1aef38630b0ffa09457@ispras.ru>
Date: Fri, 07 Oct 2022 13:53:14 +0300
From: Alexey Izbyshev <izbyshev@...ras.ru>
To: musl@...ts.openwall.com
Subject: Re: Re: MT fork and key_lock in pthread_key_create.c

On 2022-10-07 04:26, Rich Felker wrote:
> On Thu, Oct 06, 2022 at 03:50:54PM -0400, Rich Felker wrote:
>> On Thu, Oct 06, 2022 at 03:20:42PM -0400, Rich Felker wrote:
>> > On Thu, Oct 06, 2022 at 10:02:11AM +0300, Alexey Izbyshev wrote:
>> > > On 2022-10-06 09:37, Alexey Izbyshev wrote:
>> > > >Hi,
>> > > >
>> > > >I noticed that fork() doesn't take key_lock that is used to protect
>> > > >the global table of thread-specific keys. I couldn't find mentions of
>> > > >this lock in the MT fork discussion in the mailing list archive. Was
>> > > >this lock overlooked?
>> > > >
>> > > >Also, I looked at how __aio_atfork() handles a similar case with
>> > > >maplock, and it seems wrong. It takes the read lock and then simply
>> > > >unlocks it both in the parent and in the child. But if there were
>> > > >other holders of the read lock at the time of fork(), the lock won't
>> > > >end up in the unlocked state in the child. It should probably be
>> > > >completely nulled-out in the child instead.
>> > > >
>> > > Looking at aio further, I don't understand how it's supposed to work
>> > > with MT fork at all. __aio_atfork() is called in _Fork() when the
>> > > allocator locks are already held. Meanwhile another thread could be
>> > > stuck in __aio_get_queue() holding maplock in exclusive mode while
>> > > trying to allocate, resulting in deadlock.
>> >
>> > Indeed, this is messy and I don't think it makes sense to be doing
>> > this at all. The child is just going to throw away the state so the
>> > parent shouldn't need to synchronize at all, but if we walk the
>> > multi-level map[] table in the child after async fork, it's possible
>> > that the contents seen are inconsistent, even that the pointers are
>> > only half-written or something.
>> >
>> > I see a few possible solutions:
>> >
>> > 1. Just set map = 0 in the child and leak the memory. This is not
>> >    going to matter unless you're doing multiple generations of fork
>> >    with aio anyway.
>> >
>> > 2. The same, but be a little bit smarter. pthread_rwlock_tryrdlock in
>> >    the child, and if it succeeds, we know the map is consistent so we
>> >    can just zero it out the same as now. Still "leaks" but only on
>> >    contention to expand the map.
>> >
>> > 3. Getting a little smarter still: move the __aio_atfork for the
>> >    parent side from _Fork to fork, outside of the critical section
>> >    where malloc lock is held. Then proceed as in (2). Now, the
>> >    tryrdlock is guaranteed to succeed in the child. Leak is only
>> >    possible when _Fork is used (in which case the child context is an
>> >    async signal one, and thus calling any aio_* that would allocate
>> >    map[] again is UB -- note that in this case, the only reason we
>> >    have to do anything at all in the child is to prevent close from
>> >    interacting with aio).
>> >
>> > After writing them out, 3 seems like the right choice.
>> 
>> Proposed patch attached.
> 
>> diff --git a/src/aio/aio.c b/src/aio/aio.c
>> index fa24f6b6..4c3379e1 100644
>> --- a/src/aio/aio.c
>> +++ b/src/aio/aio.c
>> @@ -401,11 +401,25 @@ void __aio_atfork(int who)
>>  	if (who<0) {
>>  		pthread_rwlock_rdlock(&maplock);
>>  		return;
>> +	} else if (!who) {
>> +		pthread_rwlock_unlock(&maplock);
>> +		return;
>>  	}
>> -	if (who>0 && map) for (int a=0; a<(-1U/2+1)>>24; a++)
>> +	if (pthread_rwlock_tryrdlock(&maplock)) {
>> +		/* Obtaining lock may fail if _Fork was called nor via
>> +		 * fork. In this case, no further aio is possible from
>> +		 * child and we can just null out map so __aio_close
>> +		 * does not attempt to do anything. */
>> +		map = 0;
>> +		return;
>> +	}
>> +	if (map) for (int a=0; a<(-1U/2+1)>>24; a++)
>>  		if (map[a]) for (int b=0; b<256; b++)
>>  			if (map[a][b]) for (int c=0; c<256; c++)
>>  				if (map[a][b][c]) for (int d=0; d<256; d++)
>>  					map[a][b][c][d] = 0;
>> -	pthread_rwlock_unlock(&maplock);
>> +	/* Re-initialize the rwlock rather than unlocking since there
>> +	 * may have been more than one reference on it in the parent.
>> +	 * We are not a lock holder anyway; the thread in the parent was. */
>> +	pthread_rwlock_init(&maplock, 0);
>>  }
>> diff --git a/src/process/_Fork.c b/src/process/_Fork.c
>> index da063868..fb0fdc2c 100644
>> --- a/src/process/_Fork.c
>> +++ b/src/process/_Fork.c
>> @@ -14,7 +14,6 @@ pid_t _Fork(void)
>>  	pid_t ret;
>>  	sigset_t set;
>>  	__block_all_sigs(&set);
>> -	__aio_atfork(-1);
>>  	LOCK(__abort_lock);
>>  #ifdef SYS_fork
>>  	ret = __syscall(SYS_fork);
>> @@ -32,7 +31,7 @@ pid_t _Fork(void)
>>  		if (libc.need_locks) libc.need_locks = -1;
>>  	}
>>  	UNLOCK(__abort_lock);
>> -	__aio_atfork(!ret);
>> +	if (!ret) __aio_atfork(1);
>>  	__restore_sigs(&set);
>>  	return __syscall_ret(ret);
>>  }
>> diff --git a/src/process/fork.c b/src/process/fork.c
>> index ff71845c..80e804b1 100644
>> --- a/src/process/fork.c
>> +++ b/src/process/fork.c
>> @@ -36,6 +36,7 @@ static volatile int *const *const atfork_locks[] = {
>>  static void dummy(int x) { }
>>  weak_alias(dummy, __fork_handler);
>>  weak_alias(dummy, __malloc_atfork);
>> +weak_alias(dummy, __aio_atfork);
>>  weak_alias(dummy, __ldso_atfork);
>> 
>>  static void dummy_0(void) { }
>> @@ -50,6 +51,7 @@ pid_t fork(void)
>>  	int need_locks = libc.need_locks > 0;
>>  	if (need_locks) {
>>  		__ldso_atfork(-1);
>> +		__aio_atfork(-1);
>>  		__inhibit_ptc();
>>  		for (int i=0; i<sizeof atfork_locks/sizeof *atfork_locks; i++)
>>  			if (*atfork_locks[i]) LOCK(*atfork_locks[i]);
>> @@ -75,6 +77,7 @@ pid_t fork(void)
>>  				if (ret) UNLOCK(*atfork_locks[i]);
>>  				else **atfork_locks[i] = 0;
>>  		__release_ptc();
>> +		if (ret) __aio_atfork(0);
>>  		__ldso_atfork(!ret);
>>  	}
>>  	__restore_sigs(&set);
> 
> There's at least one other related bug here: when __aio_get_queue has
> to take the write lock, it does so without blocking signals, so close
> called from a signal handler that interrupts will self-deadlock.
> 
This is worse. Even if maplock didn't exist, __aio_get_queue still takes 
the queue lock, so close from a signal handler would still deadlock. 
Maybe this could be fixed by simply blocking signals earlier in submit? 
aio_cancel already calls __aio_get_queue with signals blocked.

> This is fixable by blocking signals for the write-lock critical
> section, but there's still a potentially long window where we're
> blocking forward progress of other threads.
> 
To make this more concrete, are you worrying about the libc-internal 
allocator delaying other threads attempting to take maplock (even in 
shared mode)? If so, this seems like a problem that is independent from 
close signal-safety.

> I think the right thing to do here is add another lock for updating
> the map that we can take while we still hold the rdlock. After taking
> this lock, we can re-inspect if any work is needed. If so, do all the
> work *still holding only the rdlock*, but not writing the results into
> the existing map. After all the allocation is done, release the
> rdlock, take the wrlock (with signals blocked), install the new
> memory, then release all the locks. This makes it so the wrlock is
> only hold momentarily, under very controlled conditions, so we don't
> have to worry about messing up AS-safety of close.
> 
Sorry, I can't understand what exactly you mean here, in particular, 
what the old lock is supposed to protect if its holders are not 
mutually-excluded with updates of the map guarded by the new lock.

I understand the general idea of doing allocations outside the critical 
section though. I imagine it in the following way:

1. Take maplock in shared mode.
2. Find the first level/entry in the map that we need to allocate. If no 
allocation is needed, goto 7.
3. Release maplock.
4. Allocate all needed map levels and a queue.
5. Take maplock in exclusive mode.
6. Modify the map as needed, remembering chunks that we need to free 
because another thread beat us.
7. Take queue lock.
8. Release maplock.
9. Free all unneeded chunks. // This could be delayed further if doing 
it under the queue lock is undesirable.

Waiting at step 7 under the exclusive lock can occur only if at step 6 
we discovered that we don't need modify the map at all (i.e. we're going 
to use an already existing queue). It doesn't seem to be a problem 
because nothing seems to take the queue lock for prolonged periods 
(except possibly step 9), but even if it is, we could "downgrade" the 
exclusive lock to the shared one by dropping it and retrying from step 1 
in a loop.

> As an alternative, the rwlock could be removed entirely in favor of
> maing the whole map[] atomic. This has the amusing effect or making
> what's conceptually one of the worst C types ever to appear in real
> code:
> 
> static struct aio_queue *volatile *volatile *volatile *volatile 
> *volatile map;
> 
> although it would actually have to be declared void *volatile map;
> with lots of type conversions to navigate through it, in order to
> respect type rules for a_cas_p. Then all that would be needed is one
> normal lock governing modifications.
> 
> Given that this code *tries* to mostly use high-level synchronization
> primitives, my leaning would be not to do that.
> 
I'm not sure how such atomic map is supposed to work with queue 
reference counting. We must ensure that "remove the queue from the map 
if we own the last reference" step is atomic, and currently it relies on 
holding the maplock in exclusive mode.

Thanks,
Alexey
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.