musl - stdatomic library, v2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1438602469.2248.2.camel@inria.fr>
Date: Mon, 03 Aug 2015 13:47:49 +0200
From: Jens Gustedt <jens.gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: stdatomic library, v2

Hello,
here is the second version of my stdatomic implementation.

The main difference to the previous one, is a handcrafted lock. This
lock is there for a very particular situation, namely critical
sections that are extremely short. It is not yet clear if the same
strategy could or should apply to other parts of musl, maybe in
replacement of the __lock/__unlock pair.

The situation is so special here, since every cycle counts.

 - On the fast-path, the shorter it gets, the less conflicts there are
   with other lockers.

 - On the mild-conflict path, we spin only once on average, so here
   also, the less cycles the less interaction with other lockers.

This lock is already implemented with stdatomic semantics. First,
because it is nicer and we finally have it, here :)

More seriously, the assembler produced is much better than with the
inline assembler macros, simply because the compiler can integrate
everything more smoothly and take advantage of particular properties
of the instructions. E.g on x86 the cmpxchg instruction already places
the result of the operation in a flag, there is no need to retest the
condition. And since, every cycle counts, see above, using stdatomic
alone seems to have benefits over __lock/__unlock.

The gain of this new lock is that on my machine my bench (list element
allocations-insertions or removal-deallocation) goes from about 700000
up to about 1000000 per second.

I also tried to benchmark the individual parts of the new algorithm to
get a hand on where the problem spots are, but exact measurement
doesn't seem possible due to Heisenberg effects.

As before, please find the "implementation" part of my notes below, I
also attach the full .org file of it.

Jens


* The implementation


** Requirements

*** Compilers

  You should be able to compile this implementation with any version
  of modern gcc and clang. (Versions are hard to tell, gcc should work
  for 4.1) The quality of the resulting binary will depend on the
  implementation of atomic support by the compiler.

  There are three different implementations, for modern clang and gcc,
  and one for those compilers that only support the =__sync_=
  built-ins. They are only tested with clang and gcc, but might work
  with other compilers that implement one of the sets of built-ins and
  is otherwise compatible to some gcc extensions:

  - compound expressions with =({ })=
  - =__attribute__= with =__alias__= and =__unused__=
  - =__builtin_choose_expr= for the =__sync= version as a precursor of
    C11's =_Generic=

  There are some heuristics in place to decide at compile time which
  case applies, namely =__clang__= to detect clang, =__ATOMIC_...=
  macros to detect the C11 versions of the built-ins.

*** OS or C library support

    The library may work with different lock constructs, currently we
    implement one simple generic approach that only uses spinning, and
    a mixed approach that uses Linux' =futex= as an inactive sleep
    strategy as a last resort. The latter has been tested with the
    =musl= C library.

    This locking strategy can be a performance bottleneck for
    applications with a strong congestion on one particular atomic
    data, e.g code that would insert list elements through a
    centralized list head. If this list head can not be realized with
    a lock-free atomic, the critical section of modifying it is
    protected by our lock. Such code has very particular properties.

    - Since the critical section usually is really short compared to a
      scheduling interval of the OS, the probability that the lock can
      be taken immediately is high. So the fast path for taking the
      lock must be *really fast*. Our implementation essentially has
      an =atomic_compare_exchange_strong_explicit=, here. One memory
      instruction on the fast path must be enough.

    - If locking fails a the first try, still the probability is very
      high that it will succeed soon after. This is because only
      scheduled threads compete, here, so there are never more threads
      in play than we have processors. Therefore as a second strategy
      we spin for a while until we get the lock. In our experiments on
      average one single round of spinning was enough.

    - A third exceptional case may occur, when the thread that is
      holding the lock is descheduled in the middle of the critical
      section. The probability for that event is quite rare (0.1 % in
      our experiments) but still this case occurs. If it does, the
      world changes drastically, a herd of threads all have to wait
      for a long time (until the locker is rescheduled) to have any
      chance to obtain the lock. Active wait here is
      counterproductive. In the contrary, by going into an inactive OS
      sleep, the possibility for the locker to regain an execution
      slot increases.

   We implement this strategy a bit differently than classical locks
   with wait-counters would do. We just have a single =unsigned= value
   that at the same time holds the lock bit (HO bit) and a
   counter. That counter is not viewed as a counter of the threads
   that are in a kernel wait, but just counts the number of threads
   inside the critical section. This has the following advantages:

   - An update to the counter part is relatively rare. So we save
     memory bandwidth, and we also avoid too much interaction between
     the different threads that compete for the lock.

   - The fast path occurs when the value is =0=, initially. It sets
     the HO bit (the lock bit) and the LO bit (for a counter of value
     =1=) in one go. The resulting value is =UINT_MAX/2u+2u=.

   - If the fast path fails, the counter is atomically incremented by
     one, and we enter a spin lock to set the HO bit as well.

   - After having spun for sometime, we suppose that we are in the bad
     situation and go into a =futex_wait=. Going into the =futex_wait=
     may fail if the value changes. Since additional threads only
     change the counter when they arrive, this can't happen too often
     and the thread goes to sleep, eventually.

   - Unlocking is a very simple operation. The locker has contributed
     =UINT_MAX/2u+2u= to the value, and so just has to decrement the
     value atomically by that amount. By doing so, the thread also
     notices if other threads still are in the critical section and
     wakens one of them.


** Caveats

*** Symbol renaming

  There is one important difficulty when compiling this. The original
  =__atomic= library interface was developed with C++ in mind and not
  C. Therefore it freely uses function overloading for the built-ins
  versus the library interface. Since we also use the library
  functions as fallbacks in the implementation of some of the =_X=
  variants this naming scheme is not supportable with a C compiler.

  We get away with it by using internal names, prefixed with =__impl_=
  for all functions. Then we rename symbols to the intended ones using
  =objcopy=.

  - The current integration into musl does this with a *script* that
    you have to run *manually* after compilation.
  - You then have to launch =make= a *second time* to do the final link.

 This technique is certainly not ideal and subject to improvement.

*** Support of 16 byte atomic instructions

    The main difference for modern processors that is relevant here is
    if it supports 16 byte atomic instructions or not. There is no
    difficulty to detect this at compile time, but if the library is
    used with code that is compiled with a different compiler or just
    different compiler options, incompatible binary code may be
    produced.

    My plan is to freeze that feature at compile time of the library
    and reflect the capacity in the =<stdatomic.h>= that is
    provided. This then may result in code that is a bit less
    optimized than it could, but that is compatible.

    - If the library is *not* compiled with direct 16 byte support the
      application may not use it, and thus use a memory implementation
      for such operations.

    - If the library *is* compiled with direct 16 byte support but the
      application compiler doesn't support it, the user code should
      fallback to library calls, but which in turn use the atomic
      instructions. So such a variant would have a call overhead and
      would not be able to inline the atomics in the user binary.

    All of this is not yet, done, though. Be careful when using this
    preliminary version.


** Leftovers

   There are some leftovers that will hopefully disappear.

   - There are several hash functions and a instrumentation
     infrastructure for the hashes. I didn't have enough test cases
     yet to see what would be best, here.

   - There is optional instrumentation for the lock
     functions. Switching it on changes overall performance
     substantially, and thus I'd expect a noticeable Heisenberg
     effect. So these counter can give qualitative information about
     what happens, you shouldn't take the figures verbally.



-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




View attachment "stdatomic-patch-v2.txt" of type "text/plain" (82181 bytes)

View attachment "README.org" of type "text/plain" (26842 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.