Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 28 Jul 2015 10:58:21 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: What's left for 1.1.11 release?

On Tue, Jul 28, 2015 at 04:50:33PM +0200, Jens Gustedt wrote:
> Am Dienstag, den 28.07.2015, 10:18 -0400 schrieb Rich Felker:
> > On Tue, Jul 28, 2015 at 04:09:38PM +0200, Jens Gustedt wrote:
> > > Hello,
> > > 
> > > Am Montag, den 27.07.2015, 23:40 -0400 schrieb Rich Felker:
> > > > In principle the a_store issue affects all libc-internal __lock/LOCK
> > > > uses,
> > > 
> > > so this worries me since I assumed that UNLOCK had release consistency
> > > for the __atomic implementation.
> > 
> > It does. The problem is that it lacks acquire consistency, which we
> > need in order to know whether to wake.
> 
> ah, I think we are speaking of different things here. I want release
> consistency for the lock operation, in the sense to be guaranteed that
> all threads that are waiting for the lock will eventually know that it
> has been released. So you are telling me, that the current version
> doesn't warrant this?

This is no problem; you get it for free on x86, and it's properly
achieved with explicit barriers on all other archs.

> The operation for which you need acquire consistency, is in fact the
> load of l[1]. Somehow the current approach is ambiguous to which is
> the atomic object. Is it l[0], is it l[1] or is it the pair of them?

l[0] is the lock word. l[1] is the waiters count and while it's
modified atomically, the read is relaxed-order. Contrary to my
expectations, real-world x86 chips will actually reorder the read of
l[1] before the store to l[0], resulting in a failure-to-wake
deadlock.

> > > > and stdio locks too, but it's only been observed in malloc.
> > > > Since there don't seem to be any performance-relevant uses of a_store
> > > > that don't actually need the proper barrier, I think we have to just
> > > > put an explicit barrier (lock orl $0,(%esp) or mfence) after the store
> > > > and live with the loss of performance.
> > > 
> > > How about using a xchg as instruction? This would perhaps "waste" a
> > > register, but that sort of optimization should not be critical in the
> > > vicinity of code that needs memory synchronization, anyhow.
> > 
> > How is this better? My intent was to avoid incurring a read on the
> > cache line that's being written and instead achieve the
> > synchronization by poking at a cache line (the stack) that should not
> > be shared.
> 
> In fact, I think you need a read on the cache line, here, don't you?
> You want to know the real value of l[1], no?

These specific callers do need l[1], but that's specific to the
callers, not fundamental to a_store. Also in principle l[1] need not
even be in the same cache line (ideally, it wouldn't be, but it's
likely to be anyway) since the alignment of l[] is just 32-bit.

> To be safe, I think this needs a full cmpxchg on the pair (l[0],
> l[1]), otherwise you can't know if the waiter count l[1] corresponds
> to the value just before the release of the lock.

No. As long as a_store behaves as a full barrier (including acquire
behavior) as it was intended to, you cannot read a value of l[1] older
than the value it had at the time of a_store, because there's a
globally consistent order between the a_inc and a_store.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.