Date: Sat, 15 Aug 2015 19:28:15 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: [PATCH] replace a mfence instruction by an xchg instruction On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote: > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker: > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > > according to the wisdom of the Internet, e.g > > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > > > Uhg, then why does this instruction even exist if it does less and > > does it slower? > > Because they do different things ?) > > mfence is to synchronize all memory, xchg, at least at a first glance, > only one word. No, any lock-prefixed instruction, or xchg which has a builtin lock, fully orders all memory accesses. Essentially it contains a builtin mfence. This is why I find it odd that a lone mfence is 3x slower. > But I also read that the relative performance of these instructions > depend a lot on the actual dice you are dealing with. Did you measure mfence being slower here? > > > Here we not only had mfence but also the mov instruction that was to be > > > protected by the fence. Replace all that by a native atomic instruction > > > that gives all the ordering guarantees that we need. > > > > > > This a_store function is performance critical for the __lock > > > primitive. In my benchmarks to test my stdatomic implementation I have a > > > substantial performance increase (more than 10%), just because malloc > > > does better with it. > > > > Is there a reason you're not using the same approach as on i386? It > > was faster than xchg for me, and in principle it "should be faster". > > I discovered your approach for i386 after I experimented with "xchg" > fore x86_64. I guess the "lock orl" instruction is a replacement for > "mfence" because that one is not implemented for all variants of i386? > > Exactly why a "mov" followed by a read-modify-write operation to some > random address (here the stack pointer) should be faster than a > read-modify-write operation with exactly the address you want to deal > with looks weird. xchg on the atomic address in principle reads a cache line (the write destination) that's known to be shared with other threads despite not needing it, then modifies is. mov on the other hand just appends the write to the local store buffer; reading the target cache line is not necessary. The subsequent barrier then takes care of ordering issues (just the one case x86 doesn't guarantee already). At least that's what happens in theory. It seemed to be slightly faster in practice for me too, which is why I went with it for now (theoretically faster + weak empirical evidence that it's faster = reasonable basis for tentative conclusion, IMO) but I'm open to further study of the different approaches and changing if we find something else is better. Ultimately I think I'd like to get rid of a_store at some point anyway since I think we can do better without it. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.