#+TITLE: An implementation of the C11 == interface #+AUTHOR: Jens Gustedt #+DATE: July, 2015 #+LATEX_HEADER: \usepackage{color} #+LATEX_HEADER: \usepackage{listings} #+LATEX_HEADER: \lstset{ #+LATEX_HEADER: keywordstyle=\color{blue}, #+LATEX_HEADER: commentstyle=\color{red}, #+LATEX_HEADER: stringstyle=\color{green}, #+LATEX_HEADER: basicstyle=\ttfamily\small, #+LATEX_HEADER: columns=fullflexible, #+LATEX_HEADER: frame=single, #+LATEX_HEADER: basewidth={0.4em,0.4em}, #+LATEX_HEADER: } * Short description The implementation of the C11 atomic interface typically sits between the implementation of the core language by the C compiler and the implementation of the C library. It needs compiler support for the individual atomic operations and library supports for the cases where no low-level atomic instruction is available and a lock must be taken. - This implementation builds entirely on the two gcc ABIs for atomics. It doesn't even attempt to go down to assembly level by itself. - We provide all function interfaces that the two gcc ABIs and the C standard need. - For compilers that don't offer the direct language support for atomics this provides a reduced but fully functional approach to atomic operations. * Implemented library features We distinguish between the implementation of library functions and the == header file. The latter already contains a lot of intelligence, because it has to do type generic stuff. This is more involved than usual C library header files. The header file is optional, the created function interface should be compatible with the header files that gcc and clang may provide. ** Type, constants and function interfaces These are the types and proper functions that are foreseen by the standard: - =atomic_flag= and its four functions - the =memory_order= enumeration type - fences - object-like macros to test for lock-freeness and similar things - =typedef= for atomic integer and pointer types. All of these are provided in forms that are compatible with gcc and clang. ** Type generic functions These are all implemented as macros, and should in many cases result in optimized inlined assembler instructions, and not in library calls. Library calls are only needed as fall back, when there is no reasonable instruction set available. This implementation uses predefined macros of the form =__GCC_HAVE_SYNC_COMPARE_AND_SWAP_X= where =X= can be one of 1, 2, 4, 8 or 16. All versions of gcc and clang since at least ten years implement these macros and the underlying operations consistently. If that macro exists, we suppose that the compiler is able to synthesize all corresponding memory functions for types of size =X= and all necessary arithmetic function for integer types of that size, as lock-free (= stateless) instructions. This doesn't mean that there are direct assembler instruction for all these operations. They can well be implemented as an unbounded loop that uses a =compare_and_swap= (CAS) primitive for atomic exchange. Gcc typically does this for the less common atomic arithmetic instructions such as =atomic_fetch_and=, for example. Lock-free doesn't mean a bounded number of instructions. For the operations that cannot be mapped to assembler instructions the compiler inserts calls to external functions. The names for these functions are typically composed of the operation and prefixed either by =__sync_= (the older gcc ABI) or =__atomic_= (the newer gcc ABI). The names of these calls can be suffixed by =_X= for =X= as above if this concerns an operation on a type of the corresponding width. All external functions that the gcc ABI's require are provided. *** The =__atomic_= ABI is already close to the C11 call interface. Relevant for C11 are 9 operations - =fetch_add= for integer addition, returning the previous value - =fetch_sub= for integer subtraction, returning the previous value - =fetch_or= for bitwise or, returning the previous value - =fetch_and= for bitwise and, returning the previous value - =fetch_xor= for bitwise xor, returning the previous value - =load= for an atomic load operation - =store= for an atomic store operation - =exchange= for an atomic exchange operation, equivalent to a =store= that returns the previous value - =compare_exchange= for an atomic compare and exchange operation, equivalent to a conditional =store= that also saves the previous value, and returns =false= or =true= according to the success of the condition. In addition to the more or less obvious operands, the built-in functions take one or two additional parameters that reflect an eventual requirement for the =memory_order= of the operation. So the functions represent the C11 "explicit" features such as =atomic_fetch_add_explicit=. Observe that the built-in functions only foresee one interface =compare_exchange=. - The distinction between "weak" and "strong" versions of these built-in functions are ruled through an additional parameter, not through a different function interface. - The function symbol fall-back =__atomic_compare_exchange= confusingly has a different semantic and prototype than the built-in function. It misses the parameter to chose between the "weak" and the "strong" version, and solely corresponds to the C11 operation =atomic_compare_exchange_strong_explicit= Load, store and compare operations have /memory/ semantics, that is they are equivalent to the use of =memcpy= and =memcmp= library functions. The implementation may use = or == operators in some places for optimization, but it then does so with objects of =uintXX_t=, so every bit is accounted for. For data types where memory and value comparison are different, the result of an =atomic_compare_exchange= operation can be different than you'd expect: - =_Bool= objects where other bits than the lowest-order bit have been polluted, will not compare equal to =false= or =true=. - Floating point types may compare different representations of =0= not to be equal. - Two floating point =NaN= may compare equal, though as value comparison =NaN= never compares equal to anything. - Objects of =struct= or =union= type may be considered unequal because they differ on some padding bytes. This behavior is in alignment with the intended interpretation by the C and C++ standard's committees. Function call interfaces for the arithmetic operations are only generated if we can suppose that an integer type for the corresponding size exists. We can reasonably assume that there are always types =uint8_t=, =uint16_t=, =uint32_t= and =uint64_t=, so the variants for 1, 2, 4 and 8 can always be generated. For a 128 bit type these are only generated if =__SIZEOF_INT128__= or =__GCC_HAVE_SYNC_COMPARE_AND_SWAP_X= exist. If so, we assume that =__uint128_t= is such an integer type and known to the compiler. Arithmetic operations can safely use these =uintXX_t= types internally, since the standard imposes two's complement representation for signed atomic types and also enforces that atomic operations may not produce traps on overflow. Additionally to the operations that have generic function interfaces in the C11 standard, gcc additionally implements six other built-ins, namely - =__atomic_add_fetch= for integer addition, returning the updated value - =__atomic_sub_fetch= for integer subtraction, returning the updated value - =__atomic_or_fetch= for bitwise or, returning the updated value - =__atomic_and_fetch= for bitwise and, returning the updated value - =__atomic_xor_fetch= for bitwise xor, returning the updated value - =__atomic_fetch_nand= for bitwise nand (=x = ~(x & v)=), returning the previous value - =__atomic_nand_fetch= for bitwise nand (=x = ~(x & v)=), returning the updated value For the completeness of the library interface we supply analogous functions with the =_X= suffix for these. They might be called by the compiler if the user code uses assign and add or similar operators on atomic integers. The =__atomic_add_fetch= and =__atomic_sub_fetch= functions may also eventually be used by the compiler to implement an atomic prefix increment or decrement operation (=++x= and =--x=). This would e.g happen if =x= is an object of type =__int128_t= and the platform doesn't implement lock-free atomics for types of size 16. *** Clang's =__c11_atomic= built-ins Clang has gone a different path for the built-ins that implement C11 atomics, prefixed with =__c11_atomic=. These are a directly feature equivalent to the C11 generic functions that have =memory_order= arguments (=_explicit= suffix). For the cases that no atomic instructions can be synthesized, clang falls back to the same external calls as described for gcc's =__atomic= ABI. *** The =__sync= ABI It dates back long before the C11 atomic interface had been designed and thus cannot be directly conforming to it. It has basically the same built-ins for arithmetic types as above, only that - The functions are named a bit differently. - They only implement sequential consistency. - There are no =load=, =store= or =exchange= features. - The =nand= operations changed their meaning from version 4.4 onward. Therefore this operation cannot be used portably in an environment that might use different versions of compilers. So we don't implement these function interfaces and we deprecate the use of this built-in. Additionally this interface also implements a =test_and_set= functionality that is used to implement the =atomic_flag= functions. This built-in is documented to have acquire-release consistency. If used with sequential consistency, an additional fence is inserted to ensure that. These features are sufficient to provide a decent implementation of C11 atomics. *** The lock-full fallback functions In absence of proper architecture support, all fallbacks (for the three built-in families) with =_X= suffix use the ones without suffix underneath. These external interfaces receive the size of the data type as an additional, leading parameter: - =__atomic_load= - =__atomic_store= - =__atomic_exchange= - =__atomic_compare_exchange= They have pure memory semantics and their basic operations are =memcpy= and =memcmp= for load, store and comparison. These functions *cannot be called directly* from within your code, because the compiler cannot distinguish them from the gcc built-ins, /and/ they have different prototypes than these. We implement these functions as critical sections that are protected with a lock, similar to a mutex. This implementations uses a table of locks and a hash function to choose one of the entries that only depends on the address of the atomic object. At the moment, this implementation has several address-hash functions that can be chosen a library-compile time. Any function that mixes the bits of the address should perform reasonably well. More important for performance is the choice of the lock. Such a lock can be relatively simple, since C11 atomics that are not lock-free don't have to be asynchronous signal safe. There are several possibilities, in order of preference: - An OS specific light-weighted lock with non-active waits. The integration into =musl= uses the internal =LOCK/UNLOCK= calls, which in turn use Linux' =futex= underneath to do an efficient wait. If by coincidence these are called in an un-threaded process, they are close to non-ops. - C11's =mtx_t= type has an shallow interface that should allow it to be implemented a bit simpler and efficient than OS specific mutexes that implement a lot of functionality. This solution should be portable to all platforms that implement this part of C11. In a relatively near future these could be all POSIX and Windows platforms. This approach has the disadvantage that a table of =mtx_t= must be initialized at process startup because =mtx_t= doesn't guarantee static initialization. - POSIX' =pthread_mutex_t= is a little less portable, but allows for static initialization. - A spinlock such as =atomic_flag=. It is portable to all platforms that implement atomics and allows for static initialization. This is the only choice when compiled without OS or library support. The wait functionality is an active wait, that burns CPU cycles and memory bandwidth. In many circumstances this should do well, the critical sections that are protected by this are nice and small. But performance can be noticeably degraded if a lot of threads repeatedly hammer on the same atomic object. * The == header file ** Full C11 support Versions of gcc and clang that fully implement the C11 atomics interface will not need a special header file but can use their own that is shipped with the compiler: - gcc starting with version 4.9 - clang starting with version 3.6 This full support of atomics allows to use atomic objects just as other objects it whatever operations the base type supports. These default operations on atomics use sequential consistency. That is, each such an operation will enforce a full memory transfer and the perceived effect is as if all these operations, even if issued in different threads, have been done one after another. Thus, thread parallelism can only play between such operations: #+BEGIN_CENTER *atomics operations are expensive* #+END_CENTER The functional interfaces with different =memory_order= arguments (=_explicit= suffix to the name) that we described above may be used to milder the memory effect that atomic operations have. The possible gain of such different memory consistency models are very architecture dependent. E.g on the x86 platforms they offer almost no advantage, whereas on ARM platforms acquire/release semantics may bring some noticeable gain. But beware that this gain is bought with a sensible complexification of the code. Only use this if the atomic operations are a measurable performance bottleneck /and/ you already have reduced the number of these operations to a minimum. ** Partial C11 atomics support A series of compiler versions offers partial atomics support that already implements most of the C11 semantic: - gcc versions 4.7 and 4.8 - clang versions 3.2 to 3.5 The versions provide the built-in functions as described above but lack full compiler support for atomic types and operations. With the == header that we supply for these compilers, application code can use the functional interfaces. A macro =_Atomic(T)= is provided that can be used to issue emulated declarations of atomic types that should be *forward compatible* to platforms with complete C11 atomics support. Example: #+begin_src C // global variables _Atomic(size_t) thread_inside_count = ATOMIC_VAR_INIT(0); _Atomic(size_t) thread_total_count = ATOMIC_VAR_INIT(1); int my_thread_function(void* arg) { atomic_fetch_add(&thread_inside_count, 1); atomic_fetch_add(&thread_total_count, 1); // do something complicated here // at the end atomic_fetch_sub(&thread_inside_count, 1); } #+end_src Underneath such emulated atomic objects are implemented as arrays of =volatile= base type of size 1. This has the following sought effects: - They can't be assigned to. - They evaluate to a pointer in almost any context. - Operations with them cannot be reordered by the compiler. So you should be relatively safe from programming errors that would access such objects without passing through the type generic atomic functions. The compiler will error out on improper usage of such atomic objects, but the diagnostics may be a bit crude. *** Issues Since this approach may reinterpret data through pointer casts, it could potentially be dangerous. So let us discuss the possible issues. - The generic fallbacks for memory access only use =memcpy= and =memcmp= to access the data itself. So the access of the data is within the constraints of the standard. - The generic fallbacks for memory access ensure that their arguments have compatible base types (if a pointer is passed in) or are assignment compatible with the base type of the atomic (if a value is passed in). So data that is copied across can never be misinterpreted as being of a wrong type because the two target types are compatible. - The specialized functions with =_X= suffix may reinterpret their data as the corresponding =uintXX_t= for the size. Copying or comparing such data is always guaranteed to use all bits, so in that sense it is equivalent to =memcpy= and =memcmp=. - The arithmetic operations that are executed then are operations on an unsigned integer type that has no padding bits. This arithmetic is compatible for all integer types that have no padding bits and, for the signed types, are represented with two's complement. - An emulated atomic with this approach is implemented as an array to the base type, and so in the user code the base type of the object remains visible to the compiler. As a consequence this approach has no effect on the aliasing rules, the compiler always has complete information about the type of each object. The only potential problem for our approach that remains is alignment. Since the stub functions that are provided may use casts to =uintXX_t= of "atomic" objects you have to ensure that these objects are at least aligned as these types would be. This should not be a problem, if the base type is an integer type, too. Integer types with same size should have the same alignment. If you encounter problems with a user defined type that has a size that is a small power of two you could force alignment #+begin_src C _Alignas(sizeof(toto)) _Atomic(toto) toto1; __attribute__((__aligned__(sizeof(toto)))) _Atomic(toto) toto2; #+end_src with whatever of the two constructs works for you. I am currently struggling to provide a version of the =_Atomic(T)= macro that ensures that automatically. It seems to be possible but produces a lot of noise for function parameters that are pointers to atomics. ** Basic atomics support Even older versions of gcc and clang implement the =__sync= built-in functions and can thereby made to accept the same header as discussed above. Since, as their names indicate, these built-ins only have fully synchronizing versions, they will not be able to take advantage of the different consistency models. But implementing atomics with stronger consistency than required, here sequential consistency, only, is conforming to the C standard. * The implementation ** Requirements *** Compilers You should be able to compile this implementation with any version of modern gcc and clang. (Versions are hard to tell, gcc should work for 4.1) The quality of the resulting binary will depend on the implementation of atomic support by the compiler. There are three different implementations, for modern clang and gcc, and one for those compilers that only support the =__sync_= built-ins. They are only tested with clang and gcc, but might work with other compilers that implement one of the sets of built-ins and is otherwise compatible to some gcc extensions: - compound expressions with =({ })= - =__attribute__= with =__alias__= and =__unused__= - =__builtin_choose_expr= for the =__sync= version as a precursor of C11's =_Generic= There are some heuristics in place to decide at compile time which case applies, namely =__clang__= to detect clang, =__ATOMIC_...= macros to detect the C11 versions of the built-ins. *** OS or C library support The library may work with different lock constructs. The musl integration uses =LOCK= and =UNLOCK= internal macros. ** Caveats *** Symbol renaming There is one important difficulty when compiling this. The original =__atomic= library interface was developed with C++ in mind and not C. Therefore it freely uses function overloading for the built-ins versus the library interface. Since we also use the library functions as fallbacks in the implementation of some of the =_X= variants this naming scheme is not supportable with a C compiler. We get away with it by using internal names, prefixed with =__impl_= for all functions. Then we rename symbols to the intended ones using =objcopy=. - The current integration into musl does this with a *script* that you have to run *manually* after compilation. - You then have to launch make a *second time* to do the final link. This technique is certainly not ideal and subject to improvement. *** Support of 16 byte atomic instructions The main difference for modern processors that is relevant here is if it supports 16 byte atomic instructions or not. There is no difficulty to detect this at compile time, but if the library is used with code that is compiled with a different compiler or just different compiler options, incompatible binary code may be produced. My plan is to freeze that feature at compile time of the library and reflect the capacity in the == that is provided. This then may result in code that is a bit less optimized than it could, but that is compatible. - If the library is *not* compiled with direct 16 byte support the application may not use it, and thus use a memory implementation for such operations. - If the library *is* compiled with direct 16 byte support but the application compiler doesn't support it, the user code should fallback to library calls, but which in turn use the atomic instructions. So such a variant would have a call overhead and would not be able to inline the atomics in the user binary. All of this is not yet, done, though. Be careful when using this preliminary version. ** Leftovers There are some leftovers that will hopefully disappear. In particular there are several hash functions and a instrumentation infrastructure for the hashes. I didn't have enough test cases yet to see what would be best, here.