Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 16 Aug 2015 12:07:40 -0400
From: Rich Felker <>
Subject: Re: [PATCH] replace a mfence instruction by an xchg

On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> > > Exactly why a "mov" followed by a read-modify-write operation to some
> > > random address (here the stack pointer) should be faster than a
> > > read-modify-write operation with exactly the address you want to deal
> > > with looks weird.
> > 
> > xchg on the atomic address in principle reads a cache line (the write
> > destination) that's known to be shared with other threads despite not
> > needing it, then modifies is. mov on the other hand just appends the
> > write to the local store buffer; reading the target cache line is not
> > necessary. The subsequent barrier then takes care of ordering issues
> > (just the one case x86 doesn't guarantee already).
> > 
> > At least that's what happens in theory. It seemed to be slightly
> > faster in practice for me too, which is why I went with it for now
> > (theoretically faster + weak empirical evidence that it's faster =
> > reasonable basis for tentative conclusion, IMO) but I'm open to
> > further study of the different approaches and changing if we find
> > something else is better.
> Yes, that theory probably isn't enough to describe things. If you look
> at the latency of a read-modify-write instructions you have something
> like
> L = I + 2 M
> where I is instruction latency itself, and M is the memory latency. M
> varies with the platform, chipset whatever, but it should be the same
> for all lock instructions on the same machine. Just to give an order
> of magnitude, M should be around 100.
> "I" in turn can be very different for different lock instructions, and
> in addition varies for different processor variants. If I read the
> tables correctly it can be from about 10 to 100 cycles, so something
> really noticeable compared to M.

I'm confused where you're getting this from. The instruction latency
for most basic ALU and mov type operations on x86 is in the
single-digit range (and I mean binary digits, in many cases ;-).

> What you'd like to have for a store instruction is
> L_{store} = I_{store} + M
> avoiding the second M for the return from memory. (The second M can be
> hidden until we hit the next load instruction)
> Your approach with orl has
> I_{mov} + I_{or} + M
> the original one has
> I_{mov} + I_{mfence} + 2 M
> which gives you an extra M
> the xchg approach has
> I_{xchg} + 2 M
> From the graphs we can deduce that in fact the orl strategy is more
> stable that the two others, but still the 20% gap for the "normal"
> situation worries me.

Where are you measuring "20% gap for the 'normal' situation"? I'm
missing some context here.

> > Ultimately I think I'd like to get rid of a_store at some point anyway
> > since I think we can do better without it.
> At least in this context, yes. From what I see at the moment the best
> strategy for the initial part of a locking primitive is to only use
> cmpxchg instructions. This instruction sets "cc", so if is integrated
> properly by the compiler, it can be used to conditionally jump,
> without even inspecting the value that had previously been in memory.
> Currently within musl such a close integration is difficult, because
> you can't easily jump from within __asm__. When using my stdatomic
> implementation (only building upon the builtin compiler support for
> atomics) the generated code is much better.

While I'd like to be able to use cc/flags directly, I don't think this
is a worthwhile optimization, at least on x86. The difference in
performance is totally dominated by the cost of the atomic op, and the
old-value output of cmpxchg is a register anyway so there's no extra
memory-read operation. Actually I experimented with GCC asm-goto
labels for sh4a ll/sc atomics, because getting the result out of the
cc bit into a register than testing it is mildly expensive there, and
gcc generated amazingly good code, but the problems are:

(1) it depends on a nasty gcc feature that other compilers don't
implement well or at all, and

(2) it imposes a branch instruction inside the inline asm even if the
caller wants to throw away the result.

In any case, this does not seem to be a worthwhile area to optimize.
The real costs of atomics are the memory latency and cache
synchronization overhead, not any time spent in instructions working
purely with registers.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.