Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 15 Aug 2015 19:28:15 -0400
From: Rich Felker <>
Subject: Re: [PATCH] replace a mfence instruction by an xchg

On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote:
> Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker:
> > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote:
> > > according to the wisdom of the Internet, e.g
> > > 
> > >
> > > 
> > > a mfence instruction is about 3 times slower than an xchg instruction.
> > 
> > Uhg, then why does this instruction even exist if it does less and
> > does it slower?
> Because they do different things ?)
> mfence is to synchronize all memory, xchg, at least at a first glance,
> only one word.

No, any lock-prefixed instruction, or xchg which has a builtin lock,
fully orders all memory accesses. Essentially it contains a builtin
mfence. This is why I find it odd that a lone mfence is 3x slower.

> But I also read that the relative performance of these instructions
> depend a lot on the actual dice you are dealing with.

Did you measure mfence being slower here?

> > > Here we not only had mfence but also the mov instruction that was to be
> > > protected by the fence. Replace all that by a native atomic instruction
> > > that gives all the ordering guarantees that we need.
> > > 
> > > This a_store function is performance critical for the __lock
> > > primitive. In my benchmarks to test my stdatomic implementation I have a
> > > substantial performance increase (more than 10%), just because malloc
> > > does better with it.
> > 
> > Is there a reason you're not using the same approach as on i386? It
> > was faster than xchg for me, and in principle it "should be faster".
> I discovered your approach for i386 after I experimented with "xchg"
> fore x86_64. I guess the "lock orl" instruction is a replacement for
> "mfence" because that one is not implemented for all variants of i386?
> Exactly why a "mov" followed by a read-modify-write operation to some
> random address (here the stack pointer) should be faster than a
> read-modify-write operation with exactly the address you want to deal
> with looks weird.

xchg on the atomic address in principle reads a cache line (the write
destination) that's known to be shared with other threads despite not
needing it, then modifies is. mov on the other hand just appends the
write to the local store buffer; reading the target cache line is not
necessary. The subsequent barrier then takes care of ordering issues
(just the one case x86 doesn't guarantee already).

At least that's what happens in theory. It seemed to be slightly
faster in practice for me too, which is why I went with it for now
(theoretically faster + weak empirical evidence that it's faster =
reasonable basis for tentative conclusion, IMO) but I'm open to
further study of the different approaches and changing if we find
something else is better.

Ultimately I think I'd like to get rid of a_store at some point anyway
since I think we can do better without it.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.