musl - Re: atomic.h cleanup

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20160122000945.GW238@brightrain.aerifal.cx>
Date: Thu, 21 Jan 2016 19:09:45 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: atomic.h cleanup

On Sun, Jan 10, 2016 at 11:57:18AM -0500, Rich Felker wrote:
> On Sun, Jan 10, 2016 at 01:21:39PM +0100, Markus Wichmann wrote:
> > Hi all,
> > 
> > The development roadmap on the musl wiki lists the ominous point
> > "atomic.h cleanup" for 1.2.0.
> > 
> > I assume you mean a sort of simplification and unification. I noticed
> > that for the RISC arch's there are rather liberal amounts of inline
> > assembly for the atomic operations. And I have always been taught, that
> > as soon as you start copying code, you are probably doing it wrong.
> > 
> > So first thing I'd do: add a new file, let's call it atomic_debruijn.h.
> > It contains an implementation of a_ctz() and a_ctz_64() based on the
> > DeBruijn number. That way, all the architectures currently implementing
> > a_ctz() in this manner can just include that file, and a lot of
> > duplicate code goes out the window.
> > 
> > Second thing: We can reduce the inline assembly footprint and the amount
> > of duplicate code by adding a new file, let's call it atomic_llsc.h,
> > that implements a_cas(), a_cas_p(), a_swap(), a_fetch_add(), a_inc(),
> > a_dec(), a_and() and a_or() in terms of new functions that would have to
> > be defined, namely:
> > 
> > static inline void a_presync(void) - execute any barrier needed before
> > attempting an atomic operation, like "dmb ish" for arm, or "sync" for
> > ppc.
> > 
> > static inline void a_postsync(void) - execute any barrier needed
> > afterwards, like "isync" for PPC, or, again, "dmb ish" for ARM.
> > 
> > static inline int a_ll(int*) - perform an LL on the given pointer and
> > return the value there. This would be "lwarx" for PPC, or "ldrex" for
> > ARM.
> > 
> > static inline int a_sc(int*, int) - perform an SC on the given pointer
> > with the given value. Return zero iff that failed.
> > 
> > static inline void* a_ll_p(void*) - same as a_ll(), but with machine
> > words instead of int, if that's a difference.
> > 
> > static inline int a_sc_p(void*, void*) - same as a_sc(), but with
> > machine words.
> > 
> > 
> > With these function we can implement e.g. CAS as:
> > 
> > static inline int a_cas(volatile int *p, int t, int s)
> > {
> >     int v;
> >     do {
> >         v = a_ll(p);
> >         if (v != t)
> >             break;
> >     } while (!a_sc(p, s));
> >     return v;
> > }
> > 
> > Add some #ifdefs to only activate the pointer variations if they're
> > needed (i.e. if we're on 64 bits) and Bob's your uncle.
> > 
> > The only hardship would be in implementing a_sc(), but that can be
> > solved by using a feature often referenced but rarely seen in the wild:
> > ASM goto. How that works is that, if the arch's SC instruction returns
> > success or failure in a flag and the CPU can jump on that flag (unlike,
> > say, microblaze, which can only jump on comparisons), then you encode
> > the jump in the assembly snippet but let the compiler handle the targets
> > for you. Since in all cases, we want to jump on failure, that's what the
> > assembly should do, so for instance for PowerPC:
> > 
> > static inline int a_sc(volatile int* p, int x)
> > {
> >     __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail);
> >     return 1;
> > fail:
> >     return 0;
> > }
> > 
> > I already tried the compiler results for such a design, but I never
> > tried running it for lack of hardware.
> > 
> > Anyway, this code makes it possible for the compiler to redirect the
> > conditional jump on failure to the top of the loop in a_cas(). Since the
> > return value isn't used otherwise, the values 1 and 0 never appear in
> > the generated assembly.
> > 
> > What do you say to this design?
> 
> Have you read this thread? :)
> 
> http://www.openwall.com/lists/musl/2015/05/20/1
> 
> I thought at one point it was linked from the wiki but maybe it got
> lost.
> 
> Basically I have this done already outside of musl as an experiment,
> but there are minor details that were holding it up. One annoyance is
> that, on some archs, success/failure of "sc" comes via a condition
> flag which the C caller can't easily branch on, so there's an extra
> conversion to a boolean result inside the asm and extra conversion
> back to a test/branch outside the asm. In practice we probably don't
> care.
> 
> One other issue is that risc-v seems to guarantee, at least on some
> implementations, stronger forward-progress guarantees than a normal
> ll/sc as long as the ll/sc are in order, within a few instruction
> slots of each other, with no branches between. Such conditions cannot
> be met without putting them in the same asm block, so we might need to
> do a custom version for risc-v if we want to take advantage of the
> stronger properties.
> 
> Anyway, at this point the main obstacle to finishing the task is doing
> the actual merging and testing, not any new coding, I think.

Most of this is committed now!

The original commit introducing the new system is:

http://git.musl-libc.org/cgit/musl/commit/?id=1315596b510189b5159e742110b504177bdd4932

Subsequent commits are converting archs over one by one to make best
use of the new system.

I'd really like to see some before-and-after benchmarks on real
hardware. For Timo's cond_bench test which I've been using as a quick
check to make sure the new atomics have no obvious breakage,
performance under qemu user-level emulation has roughly doubled for
most of the llsc archs due to better ability of gcc to inline (it
refuses to inline the largeish ll/sc loops written in asm but is happy
to inline the tiny a_ll and a_sc asm).

Please report any regressions, etc.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.