musl - Re: Deduplicating atomics written in terms of CAS

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150517175918.GQ17573@brightrain.aerifal.cx>
Date: Sun, 17 May 2015 13:59:18 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Deduplicating atomics written in terms of CAS

On Sun, May 17, 2015 at 06:59:53PM +0200, Jens Gustedt wrote:
> Am Sonntag, den 17.05.2015, 12:28 -0400 schrieb Rich Felker:
> > On Sun, May 17, 2015 at 09:37:19AM +0200, Jens Gustedt wrote:
> > > Am Sonntag, den 17.05.2015, 02:14 -0400 schrieb Rich Felker:
> > > > - a_and_64/a_or_64 (malloc only; these are misnamed too)
> > > 
> > > I should have checked the use before my last mail. They are
> > > definitively misnamed.
> > > 
> > > Both uses of them look ok concerning atomicity, only one of the a_and
> > > or a_or calls triggers.
> > > 
> > > The only object (mal.binmap) to which this is applied is in fact
> > > volatile, so it must actually be reloaded all the time it is used.
> > > 
> > > But in line 352 the code uses another assumption, then, that 64 bit
> > > loads always are atomic. I don't see why this should hold in general.
> > 
> > I don't think there's such an assumption. The only assumption is that
> > each bit is read exactly the number of times it would be on the
> > abstract machine, so that we can't observe inconsistent values for the
> > same object. Lack of any heavy synchronization around reading the mask
> > may result in failure to see some changes or seeing them out of order,
> > but it doesn't matter: If a bin is wrongly seen as non-empty, locking
> > and attempting to unbin from it will fail. If it is wrongly seen as
> > empty, the worst that can happen is a less-optimal (but would have
> > been optimal an instant earlier) larger chunk gets split instead of
> > using a smaller one to satisfy the allocation.
> 
> So to summarize what you are saying that in this special context, an
> out-of-sync load of one of the sub-words of a 64 bit word, would only
> impact on performance and not correctness. Nice.

Right. Technically it may also impact fragmentation-optimality, but
only in a way that would also be affected by timing/scheduling
differences, so I don't think it makes sense to insist on an
optimality condition there.

> A maybe stupid question, then: why do atomics at all, here? You could
> perhaps remove all that 64 bit pseudo atomic stuff then.

We absolutely would not want two concurrent attempts to set a bit to
clobber each others result, so that the other result would _never_ be
seen. This would result in free memory being lost -- the bin would
appear to be empty until something was added to it again. That's the
motivation for the atomics.

> > Of course it's an open question whether the complex atomics and
> > fine-grained locking in malloc help or hurt performance more on
> > average. I'd really like to measure this at some point. Overhauling
> > malloc to try to get significantly better multi-threaded performance
> > without the fragmentation-optimality sacrifices other mallocs make is
> > a long-term goal I have open.
> > 
> > > We already have a similar assumption for 32 bit int all over the
> > > place, and I am not too happy with such "silent" assumption. For 64
> > > bit, this assumption looks wrong to me.
> > 
> > I agree I wouldn't be happy with such an assumption, but I don't think
> > it's being made here.
> > 
> > > I would be much happier by using explicit atomic types and atomic load
> > > functions or macros everywhere. For normal builds these could be dummy
> > > types made to resolve to the actual code that we have, now. But this
> > > would allow to have hardening builds, that check for consistency of
> > > all atomic accesses.
> > 
> > There is no way to do an atomic 64-bit load on most of the archs we
> > support. So trying to make it explicit wouldn't help.
> 
> Ah sorry, I probably went too fast. My last paragraph would be for all
> atomic operations, so in particular 32 bit. A macro "a_load" would
> make intentions clearer and would perhaps allow to implement an
> optional compile time check to see if we use any object consistently
> as atomic or not.

The reason I'm mildly against this is that all current reads of
atomics, except via the return value of a_cas or a_fetch_add, are
relaxed-order. We don't care if we see a stale value; if staleness
could be a problem, the caller takes care of that in an efficient way.
Having a_load that's relaxed-order whereas all the existing atomics
are seq_cst order would be an inconsistent API design.

Adding a_load_relaxed or something would be an option, but I'm still
not really a fan. Compilers always aim to do volatile ops as a single
load/store when possible (this matters for some MMIO-type uses that
volatile is intended for; some hardware treats 4 byte writes
differently from one word write, and of course the read direction
matters when reading MMIO back from hardware) so there's no reason to
expect the loads to break. I think it was more of an issue before we
moved to using volatile to model atomics.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.