Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 16 Aug 2015 18:34:54 +0200
From: Jens Gustedt <jens.gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] replace a mfence instruction by an xchg
 instruction

Am Sonntag, den 16.08.2015, 12:07 -0400 schrieb Rich Felker:
> On Sun, Aug 16, 2015 at 02:42:33PM +0200, Jens Gustedt wrote:
> > > > Exactly why a "mov" followed by a read-modify-write operation to some
> > > > random address (here the stack pointer) should be faster than a
> > > > read-modify-write operation with exactly the address you want to deal
> > > > with looks weird.
> > > 
> > > xchg on the atomic address in principle reads a cache line (the write
> > > destination) that's known to be shared with other threads despite not
> > > needing it, then modifies is. mov on the other hand just appends the
> > > write to the local store buffer; reading the target cache line is not
> > > necessary. The subsequent barrier then takes care of ordering issues
> > > (just the one case x86 doesn't guarantee already).
> > > 
> > > At least that's what happens in theory. It seemed to be slightly
> > > faster in practice for me too, which is why I went with it for now
> > > (theoretically faster + weak empirical evidence that it's faster =
> > > reasonable basis for tentative conclusion, IMO) but I'm open to
> > > further study of the different approaches and changing if we find
> > > something else is better.
> > 
> > Yes, that theory probably isn't enough to describe things. If you look
> > at the latency of a read-modify-write instructions you have something
> > like
> > 
> > L = I + 2 M
> > 
> > where I is instruction latency itself, and M is the memory latency. M
> > varies with the platform, chipset whatever, but it should be the same
> > for all lock instructions on the same machine. Just to give an order
> > of magnitude, M should be around 100.
> > 
> > "I" in turn can be very different for different lock instructions, and
> > in addition varies for different processor variants. If I read the
> > tables correctly it can be from about 10 to 100 cycles, so something
> > really noticeable compared to M.
> 
> I'm confused where you're getting this from. The instruction latency
> for most basic ALU and mov type operations on x86 is in the
> single-digit range (and I mean binary digits, in many cases ;-).

I was talking about the latencies of locked instructions, they are
significantly higher. So if you want to discuss overall latency of an
instruction that guarantees memory order, you'd have to take these
latencies into account.

> > What you'd like to have for a store instruction is
> > 
> > L_{store} = I_{store} + M
> > 
> > avoiding the second M for the return from memory. (The second M can be
> > hidden until we hit the next load instruction)
> > 
> > Your approach with orl has
> > 
> > I_{mov} + I_{or} + M
> > 
> > the original one has
> > 
> > I_{mov} + I_{mfence} + 2 M
> > 
> > which gives you an extra M
> > 
> > the xchg approach has
> > 
> > I_{xchg} + 2 M
> > 
> > From the graphs we can deduce that in fact the orl strategy is more
> > stable that the two others, but still the 20% gap for the "normal"
> > situation worries me.
> 
> Where are you measuring "20% gap for the 'normal' situation"? I'm
> missing some context here.

In the graphics that I had attached. For 1 thread, the performance
improvement against the "mfence" version with the "orl" trick is 20%,
but the "xchg" gets even 20% more.

I didn't measure this, but before the introduction of the "mfence" to
"a_store" the performance should have been at least as for the "xchg"
version.

> > > Ultimately I think I'd like to get rid of a_store at some point anyway
> > > since I think we can do better without it.
> > 
> > At least in this context, yes. From what I see at the moment the best
> > strategy for the initial part of a locking primitive is to only use
> > cmpxchg instructions. This instruction sets "cc", so if is integrated
> > properly by the compiler, it can be used to conditionally jump,
> > without even inspecting the value that had previously been in memory.
> > 
> > Currently within musl such a close integration is difficult, because
> > you can't easily jump from within __asm__. When using my stdatomic
> > implementation (only building upon the builtin compiler support for
> > atomics) the generated code is much better.
> 
> While I'd like to be able to use cc/flags directly, I don't think this
> is a worthwhile optimization, at least on x86. The difference in
> performance is totally dominated by the cost of the atomic op, and the
> old-value output of cmpxchg is a register anyway so there's no extra
> memory-read operation. Actually I experimented with GCC asm-goto
> labels for sh4a ll/sc atomics, because getting the result out of the
> cc bit into a register than testing it is mildly expensive there, and
> gcc generated amazingly good code, but the problems are:
> 
> (1) it depends on a nasty gcc feature that other compilers don't
> implement well or at all, and
> 
> (2) it imposes a branch instruction inside the inline asm even if the
> caller wants to throw away the result.

Yes, I definitively agree. That is what you can avoid by using the
compiler's own constructs, the __sync, __atomic or __c11_atomic
buitlins. Here they know about the features of the cmpexchg
instruction and can do appropriate jumps.

> In any case, this does not seem to be a worthwhile area to optimize.
> The real costs of atomics are the memory latency and cache
> synchronization overhead, not any time spent in instructions working
> purely with registers.

I don't agree. First this was not completely what I meant. I think
being able to use the CC to jump can also abort some fetches of memory
that are still in the pipelin, so this is perhaps more than some
"register" instructions.

Then, at the beginning of all lock constructs we have tight spin
loops. As far as I can tell, in case of contention every instruction
in them counts.

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::





Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.