Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 16 Aug 2015 14:42:33 +0200
From: Jens Gustedt <jens.gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] replace a mfence instruction by an xchg
 instruction

Hello,

Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker:
> On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote:
> > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker:
> > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > > > according to the wisdom of the Internet, e.g
> > > > 
> > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/
> > > > 
> > > > a mfence instruction is about 3 times slower than an xchg instruction.
> > > 
> > > Uhg, then why does this instruction even exist if it does less and
> > > does it slower?
> > 
> > Because they do different things ?)
> > 
> > mfence is to synchronize all memory, xchg, at least at a first glance,
> > only one word.
> 
> No, any lock-prefixed instruction, or xchg which has a builtin lock,
> fully orders all memory accesses. Essentially it contains a builtin
> mfence.

Hm, I think mfence does a bit more than that. The three fence
instructions were introduced when they invented the asynchronous
("non-temporal") move instructions that came with sse.

I don't think that "lock" instructions synchronize with these
asynchronous moves, so the two (lock instructions and fences) are just
different types of animals. And this answers perhaps your question
up-thread, why there is actually something like mfence.

> This is why I find it odd that a lone mfence is 3x slower.

> > But I also read that the relative performance of these instructions
> > depend a lot on the actual dice you are dealing with.
> 
> Did you measure mfence being slower here?

Yes, I attach two graphs that show this (and other things). This is my
benchmark for testing generic atomics, here by using __lock/__unlock
as a primitive. If you just look at the performance difference for one
thread, it is is dramatic. Just by changing the a_store instruction
the application throughput is 40% better.

The difference comes not only from the fact that performance of
__lock/__unlock changes, but also because the whole rest of musl
changes performance. In particular, I think, because the
__lock/__unlock clones in malloc are affected by that change,
too. (That benchmarks also does a lot of malloc/free.)

What I didn't notice before running these tests systematically this
night, is that the performance of that bench also changes for the
congestion case, but there it is the other way around.

> > > > Here we not only had mfence but also the mov instruction that was to be
> > > > protected by the fence. Replace all that by a native atomic instruction
> > > > that gives all the ordering guarantees that we need.
> > > > 
> > > > This a_store function is performance critical for the __lock
> > > > primitive. In my benchmarks to test my stdatomic implementation I have a
> > > > substantial performance increase (more than 10%), just because malloc
> > > > does better with it.
> > > 
> > > Is there a reason you're not using the same approach as on i386? It
> > > was faster than xchg for me, and in principle it "should be faster".
> > 
> > I discovered your approach for i386 after I experimented with "xchg"
> > fore x86_64. I guess the "lock orl" instruction is a replacement for
> > "mfence" because that one is not implemented for all variants of i386?
> > 
> > Exactly why a "mov" followed by a read-modify-write operation to some
> > random address (here the stack pointer) should be faster than a
> > read-modify-write operation with exactly the address you want to deal
> > with looks weird.
> 
> xchg on the atomic address in principle reads a cache line (the write
> destination) that's known to be shared with other threads despite not
> needing it, then modifies is. mov on the other hand just appends the
> write to the local store buffer; reading the target cache line is not
> necessary. The subsequent barrier then takes care of ordering issues
> (just the one case x86 doesn't guarantee already).
> 
> At least that's what happens in theory. It seemed to be slightly
> faster in practice for me too, which is why I went with it for now
> (theoretically faster + weak empirical evidence that it's faster =
> reasonable basis for tentative conclusion, IMO) but I'm open to
> further study of the different approaches and changing if we find
> something else is better.

Yes, that theory probably isn't enough to describe things. If you look
at the latency of a read-modify-write instructions you have something
like

L = I + 2 M

where I is instruction latency itself, and M is the memory latency. M
varies with the platform, chipset whatever, but it should be the same
for all lock instructions on the same machine. Just to give an order
of magnitude, M should be around 100.

"I" in turn can be very different for different lock instructions, and
in addition varies for different processor variants. If I read the
tables correctly it can be from about 10 to 100 cycles, so something
really noticeable compared to M.

What you'd like to have for a store instruction is

L_{store} = I_{store} + M

avoiding the second M for the return from memory. (The second M can be
hidden until we hit the next load instruction)

Your approach with orl has

I_{mov} + I_{or} + M

the original one has

I_{mov} + I_{mfence} + 2 M

which gives you an extra M

the xchg approach has

I_{xchg} + 2 M

From the graphs we can deduce that in fact the orl strategy is more
stable that the two others, but still the 20% gap for the "normal"
situation worries me.

> Ultimately I think I'd like to get rid of a_store at some point anyway
> since I think we can do better without it.

At least in this context, yes. From what I see at the moment the best
strategy for the initial part of a locking primitive is to only use
cmpxchg instructions. This instruction sets "cc", so if is integrated
properly by the compiler, it can be used to conditionally jump,
without even inspecting the value that had previously been in memory.

Currently within musl such a close integration is difficult, because
you can't easily jump from within __asm__. When using my stdatomic
implementation (only building upon the builtin compiler support for
atomics) the generated code is much better.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




Download attachment "test-musl-all.eps" of type "image/x-eps" (21344 bytes)

Download attachment "test-musl-relative.eps" of type "image/x-eps" (21498 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.