Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Wed, 20 May 2015 01:11:08 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Refactoring atomics as llsc?

In the inline sh4a atomics thread, I discussed an idea of refactoring
atomics for llsc-style archs so that the arch would just need to
provide inline asm ll() and sc() functions (with more namespace-clean
names of course), and a shared atomic.h could build the a_* functions
on top of that, as in:

static inline int a_cas(volatile int *p, int t, int s)
{
	int old;
	do old = ll(p);
	while (old == t && !sc(p, s));
	return old;
}

(Note: I've omitted barriers for simplicity; they could be in the ll
and sc functions but it would probably make more sense to have them
outside the loop.)

In the sh4a thread, I was somewhat discouraged with this approach
because there's no way to model the output of the sc asm coming out in
a condition flag; the inline asm would have to convert the flag to a
value in a register. However, looking at the archs we have now, only
or1k, powerpc, and sh put the sc output in a condition flag. The rest
of the llsc-type archs leave the result as a boolean value in a
general-purpose register. So only a few archs would be negatively
affected, and only in a very minor way.

On the other hand, the benefits of doing this would be pretty
significant:

First of all, basically all archs could share all the implementation
logic for atomics, massively reducing the amount of per-arch asm and
the risk of subtle errors. Right now the only way to keep per-arch asm
down is to implement just a_cas and have everything else be a wrapper
for a_cas, but that leads to all the other atomics being a pair of
nested do/while loops, which is larger and slower than having a
"native" version of each op.

Second, this approach would allow us to add some really nice new
atomic primitives without having to write them per-arch: things like
atomic-dec-if-positive. Normally these would be written with a CAS
retry loop, which is a pair of nested do/while loops on llsc archs,
but with the above approach it would be a single loop.

Of course the big outlier is x86, which is not llsc based but has
actual atomic primitives at the instruction level. If we defined the
sc() primitive to take 3 args instead of 2 (address, old value from
ll, new value to conditionally store; most archs would ignore the old
value argument) then we could model x86 with ll being a plain load and
sc being cmpxchg to allow any new custom primitives to work using
cmpxchg. Then we would just continue providing custom versions of all
the old a_* ops (a_cas, a_fetch_add, a_inc, a_dec, a_and, a_or,
a_swap) to take advantage of the x86 instructions. These versions
could probably be shared by all x86 variants (i386, x86_64, x32) since
they're operating on 32-bit values and the asm should be the same.

If we decide to go this way, it would replace the atomic.h refactoring
work I already sent to the list (Deduplicating atomics written in
terms of CAS).

For the few archs that would be adversely affected (albeit very minor)
by this approach due to the inability to model condition-flags in asm
outputs, we could in principle still keep some or all of the old asm
(probably cutting it down to the primitives that actually matter for
performance) if desired.

We could also keep the concept of atomic_generic.h using __sync
builtins, but offer ll() and sc() using a plain load for ll and
__sync_bool_compare_and_swap for sc, and then define a few other a_*
functions that the __sync primitives allow us to define directly.

Comments?

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.