musl - Re: [PATCH] replace a mfence instruction by an xchg instruction

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150816160740.GQ31018@brightrain.aerifal.cx>
Date: Sun, 16 Aug 2015 12:07:40 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] replace a mfence instruction by an xchg
 instruction

On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> > > Exactly why a "mov" followed by a read-modify-write operation to some
> > > random address (here the stack pointer) should be faster than a
> > > read-modify-write operation with exactly the address you want to deal
> > > with looks weird.
> > 
> > xchg on the atomic address in principle reads a cache line (the write
> > destination) that's known to be shared with other threads despite not
> > needing it, then modifies is. mov on the other hand just appends the
> > write to the local store buffer; reading the target cache line is not
> > necessary. The subsequent barrier then takes care of ordering issues
> > (just the one case x86 doesn't guarantee already).
> > 
> > At least that's what happens in theory. It seemed to be slightly
> > faster in practice for me too, which is why I went with it for now
> > (theoretically faster + weak empirical evidence that it's faster =
> > reasonable basis for tentative conclusion, IMO) but I'm open to
> > further study of the different approaches and changing if we find
> > something else is better.
> 
> Yes, that theory probably isn't enough to describe things. If you look
> at the latency of a read-modify-write instructions you have something
> like
> 
> L = I + 2 M
> 
> where I is instruction latency itself, and M is the memory latency. M
> varies with the platform, chipset whatever, but it should be the same
> for all lock instructions on the same machine. Just to give an order
> of magnitude, M should be around 100.
> 
> "I" in turn can be very different for different lock instructions, and
> in addition varies for different processor variants. If I read the
> tables correctly it can be from about 10 to 100 cycles, so something
> really noticeable compared to M.

I'm confused where you're getting this from. The instruction latency
for most basic ALU and mov type operations on x86 is in the
single-digit range (and I mean binary digits, in many cases ;-).

> What you'd like to have for a store instruction is
> 
> L_{store} = I_{store} + M
> 
> avoiding the second M for the return from memory. (The second M can be
> hidden until we hit the next load instruction)
> 
> Your approach with orl has
> 
> I_{mov} + I_{or} + M
> 
> the original one has
> 
> I_{mov} + I_{mfence} + 2 M
> 
> which gives you an extra M
> 
> the xchg approach has
> 
> I_{xchg} + 2 M
> 
> From the graphs we can deduce that in fact the orl strategy is more
> stable that the two others, but still the 20% gap for the "normal"
> situation worries me.

Where are you measuring "20% gap for the 'normal' situation"? I'm
missing some context here.

> > Ultimately I think I'd like to get rid of a_store at some point anyway
> > since I think we can do better without it.
> 
> At least in this context, yes. From what I see at the moment the best
> strategy for the initial part of a locking primitive is to only use
> cmpxchg instructions. This instruction sets "cc", so if is integrated
> properly by the compiler, it can be used to conditionally jump,
> without even inspecting the value that had previously been in memory.
> 
> Currently within musl such a close integration is difficult, because
> you can't easily jump from within __asm__. When using my stdatomic
> implementation (only building upon the builtin compiler support for
> atomics) the generated code is much better.

While I'd like to be able to use cc/flags directly, I don't think this
is a worthwhile optimization, at least on x86. The difference in
performance is totally dominated by the cost of the atomic op, and the
old-value output of cmpxchg is a register anyway so there's no extra
memory-read operation. Actually I experimented with GCC asm-goto
labels for sh4a ll/sc atomics, because getting the result out of the
cc bit into a register than testing it is mildly expensive there, and
gcc generated amazingly good code, but the problems are:

(1) it depends on a nasty gcc feature that other compilers don't
implement well or at all, and

(2) it imposes a branch instruction inside the inline asm even if the
caller wants to throw away the result.

In any case, this does not seem to be a worthwhile area to optimize.
The real costs of atomics are the memory latency and cache
synchronization overhead, not any time spent in instructions working
purely with registers.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.