Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 21 Jan 2016 19:09:45 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: atomic.h cleanup

On Sun, Jan 10, 2016 at 11:57:18AM -0500, Rich Felker wrote:
> On Sun, Jan 10, 2016 at 01:21:39PM +0100, Markus Wichmann wrote:
> > Hi all,
> > 
> > The development roadmap on the musl wiki lists the ominous point
> > "atomic.h cleanup" for 1.2.0.
> > 
> > I assume you mean a sort of simplification and unification. I noticed
> > that for the RISC arch's there are rather liberal amounts of inline
> > assembly for the atomic operations. And I have always been taught, that
> > as soon as you start copying code, you are probably doing it wrong.
> > 
> > So first thing I'd do: add a new file, let's call it atomic_debruijn.h.
> > It contains an implementation of a_ctz() and a_ctz_64() based on the
> > DeBruijn number. That way, all the architectures currently implementing
> > a_ctz() in this manner can just include that file, and a lot of
> > duplicate code goes out the window.
> > 
> > Second thing: We can reduce the inline assembly footprint and the amount
> > of duplicate code by adding a new file, let's call it atomic_llsc.h,
> > that implements a_cas(), a_cas_p(), a_swap(), a_fetch_add(), a_inc(),
> > a_dec(), a_and() and a_or() in terms of new functions that would have to
> > be defined, namely:
> > 
> > static inline void a_presync(void) - execute any barrier needed before
> > attempting an atomic operation, like "dmb ish" for arm, or "sync" for
> > ppc.
> > 
> > static inline void a_postsync(void) - execute any barrier needed
> > afterwards, like "isync" for PPC, or, again, "dmb ish" for ARM.
> > 
> > static inline int a_ll(int*) - perform an LL on the given pointer and
> > return the value there. This would be "lwarx" for PPC, or "ldrex" for
> > ARM.
> > 
> > static inline int a_sc(int*, int) - perform an SC on the given pointer
> > with the given value. Return zero iff that failed.
> > 
> > static inline void* a_ll_p(void*) - same as a_ll(), but with machine
> > words instead of int, if that's a difference.
> > 
> > static inline int a_sc_p(void*, void*) - same as a_sc(), but with
> > machine words.
> > 
> > 
> > With these function we can implement e.g. CAS as:
> > 
> > static inline int a_cas(volatile int *p, int t, int s)
> > {
> >     int v;
> >     do {
> >         v = a_ll(p);
> >         if (v != t)
> >             break;
> >     } while (!a_sc(p, s));
> >     return v;
> > }
> > 
> > Add some #ifdefs to only activate the pointer variations if they're
> > needed (i.e. if we're on 64 bits) and Bob's your uncle.
> > 
> > The only hardship would be in implementing a_sc(), but that can be
> > solved by using a feature often referenced but rarely seen in the wild:
> > ASM goto. How that works is that, if the arch's SC instruction returns
> > success or failure in a flag and the CPU can jump on that flag (unlike,
> > say, microblaze, which can only jump on comparisons), then you encode
> > the jump in the assembly snippet but let the compiler handle the targets
> > for you. Since in all cases, we want to jump on failure, that's what the
> > assembly should do, so for instance for PowerPC:
> > 
> > static inline int a_sc(volatile int* p, int x)
> > {
> >     __asm__ goto ("stwcx. %0, 0, %1\n\tbne- %l2" : : "r"(x), "r"(p) : "cc", "memory" : fail);
> >     return 1;
> > fail:
> >     return 0;
> > }
> > 
> > I already tried the compiler results for such a design, but I never
> > tried running it for lack of hardware.
> > 
> > Anyway, this code makes it possible for the compiler to redirect the
> > conditional jump on failure to the top of the loop in a_cas(). Since the
> > return value isn't used otherwise, the values 1 and 0 never appear in
> > the generated assembly.
> > 
> > What do you say to this design?
> 
> Have you read this thread? :)
> 
> http://www.openwall.com/lists/musl/2015/05/20/1
> 
> I thought at one point it was linked from the wiki but maybe it got
> lost.
> 
> Basically I have this done already outside of musl as an experiment,
> but there are minor details that were holding it up. One annoyance is
> that, on some archs, success/failure of "sc" comes via a condition
> flag which the C caller can't easily branch on, so there's an extra
> conversion to a boolean result inside the asm and extra conversion
> back to a test/branch outside the asm. In practice we probably don't
> care.
> 
> One other issue is that risc-v seems to guarantee, at least on some
> implementations, stronger forward-progress guarantees than a normal
> ll/sc as long as the ll/sc are in order, within a few instruction
> slots of each other, with no branches between. Such conditions cannot
> be met without putting them in the same asm block, so we might need to
> do a custom version for risc-v if we want to take advantage of the
> stronger properties.
> 
> Anyway, at this point the main obstacle to finishing the task is doing
> the actual merging and testing, not any new coding, I think.

Most of this is committed now!

The original commit introducing the new system is:

http://git.musl-libc.org/cgit/musl/commit/?id=1315596b510189b5159e742110b504177bdd4932

Subsequent commits are converting archs over one by one to make best
use of the new system.

I'd really like to see some before-and-after benchmarks on real
hardware. For Timo's cond_bench test which I've been using as a quick
check to make sure the new atomics have no obvious breakage,
performance under qemu user-level emulation has roughly doubled for
most of the llsc archs due to better ability of gcc to inline (it
refuses to inline the largeish ll/sc loops written in asm but is happy
to inline the tiny a_ll and a_sc asm).

Please report any regressions, etc.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.