computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve

synchronization Synchronization is the coordination of events to operate a system in unison. For example, the conductor of an orchestra keeps the orchestra synchronized or ''in time''. Systems that operate with all parts in synchrony are said to be synchronou ...

. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response (this variant is often called compare-and-set), or by returning the value read from the memory location (''not'' the value written to it).

Overview

A compare-and-swap operation is an atomic version of the following

pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...

, where denotes access through a pointer: function cas(p: pointer to int, old: int, new: int) is if *p ≠ old return false *p ← new return true This operation is used to implement synchronization primitives like

semaphore Semaphore (; ) is the use of an apparatus to create a visual signal transmitted over distance. A semaphore can be performed with devices including: fire, lights, flags, sunlight, and moving arms. Semaphores can be used for telegraphy when arr ...

s and

mutex In computer science, a lock or mutex (from mutual exclusion) is a synchronization primitive: a mechanism that enforces limits on access to a resource when there are many threads of execution. A lock is designed to enforce a mutual exclusion concur ...

es, as well as more sophisticated lock-free and wait-free algorithms.

Maurice Herlihy Maurice Peter Herlihy (born 4 January 1954) is a computer scientist active in the field of multiprocessor synchronization. Herlihy has contributed to areas including theoretical foundations of wait-free synchronization, linearizable data structur ...

(1991) proved that CAS can implement more of these algorithms than atomic read, write, or

fetch-and-add In computer science, the fetch-and-add CPU instruction (FAA) atomically increments the contents of a memory location by a specified value. That is, fetch-and-add performs the operation :increment the value at address by , where is a memory loc ...

, and assuming a fairly large amount of memory, that it can implement all of them. CAS is equivalent to load-link/store-conditional, in the sense that a constant number of invocations of either primitive can be used to implement the other one in a wait-free manner. Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again. Instead of immediately retrying after a CAS operation fails, researchers have found that total system performance can be improved in multiprocessor systems—where many threads constantly update some particular shared variable—if threads that see their CAS fail use exponential backoff—in other words, wait a little before retrying the CAS.

Example application: atomic adder

As an example use case of compare-and-swap, here is an algorithm for atomically incrementing or decrementing an integer. This is useful in a variety of applications that use counters. The function performs the action , atomically (again denoting pointer indirection by , as in C) and returns the final value stored in the counter. Unlike in the pseudocode above, there is no requirement that any sequence of operations is atomic except for . function add(p: pointer to int, a: int) returns int done ← false while not done value ← *p // Even this operation doesn't need to be atomic. done ← cas(p, value, value + a) return value + a In this algorithm, if the value of changes after (or while!) it is fetched and before the CAS does the store, CAS will notice and report this fact, causing the algorithm to retry.

ABA problem

Some CAS-based algorithms are affected by and must handle the problem of a

false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...

match, or the ABA problem. It is possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value. The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter. A general solution to this is to use a double-length CAS (DCAS). E.g., on a 32-bit system, a 64-bit CAS can be used. The second half is used to hold a counter. The compare part of the operation compares the previously read value of the pointer ''and'' the counter, with the current pointer and counter. If they match, the swap occurs - the new value is written - but the new value has an incremented counter. This means that if ABA has occurred, although the pointer value will be the same, the counter is exceedingly unlikely to be the same (for a 32-bit value, a multiple of 2³² operations would have to have occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same). An alternative form of this (useful on CPUs which lack DCAS) is to use an index into a freelist, rather than a full pointer, e.g. with a 32-bit CAS, use a 16-bit index and a 16-bit counter. However, the reduced counter lengths begin to make ABA possible at modern CPU speeds. One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure. A more complicated but more effective solution is to implement safe memory reclamation (SMR). This is in effect lock-free garbage collection. The advantage of using SMR is the assurance a given pointer will exist only once at any one time in the data structure, thus the ABA problem is completely solved. (Without SMR, something like a freelist will be in use, to ensure that all data elements can be accessed safely (no memory access violations) even when they are no longer present in the data structure. With SMR, only elements actually currently in the data structure will be accessed).

Costs and benefits

CAS, and other atomic instructions, are sometimes thought to be unnecessary in uniprocessor systems, because the atomicity of any sequence of instructions can be achieved by disabling interrupts while executing it. However, disabling interrupts has numerous downsides. For example, code that is allowed to do so must be trusted not to be malicious and monopolize the CPU, as well as to be correct and not accidentally hang the machine in an infinite loop or page fault. Further, disabling interrupts is often deemed too expensive to be practical. Thus, even programs only intended to run on uniprocessor machines will benefit from atomic instructions, as in the case of Linux's

futex In computing, a futex (short for "fast userspace mutex") is a kernel system call that programmers can use to implement basic locking, or as a building block for higher-level locking abstractions such as semaphores and POSIX mutexes or condition va ...

es. In multiprocessor systems, it is usually impossible to disable interrupts on all processors at the same time. Even if it were possible, two or more processors could be attempting to access the same semaphore's memory at the same time, and thus atomicity would not be achieved. The compare-and-swap instruction allows any processor to atomically test and modify a memory location, preventing such multiple-processor collisions. On server-grade multi-processor architectures of the 2010s, compare-and-swap is cheap relative to a simple load that is not served from cache. A 2013 paper points out that a CAS is only 1.15 times more expensive than a non-cached load on Intel Xeon ( Westmere-EX) and 1.35 times on AMD

Opteron Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture (known generically as x86-64 or AMD64). It was released on April 22, 2003, with the ''Sledg ...

(Magny-Cours).

Implementations

Compare-and-swap (and compare-and-swap-double) has been an integral part of the IBM 370 (and all successor) architectures since 1970. The operating systems that run on these architectures make extensive use of this instruction to facilitate process (i.e., system and user tasks) and processor (i.e., central processors) parallelism while eliminating, to the greatest degree possible, the "disabled

spinlock In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available. Since the thread remains active but is not performing a useful task, ...

s" which had been employed in earlier IBM operating systems. Similarly, the use of test-and-set was also eliminated. In these operating systems, new units of work may be instantiated "globally", into the global service priority list, or "locally", into the local service priority list, by the execution of a single compare-and-swap instruction. This substantially improved the responsiveness of these operating systems. In the x86 (since

80486 The Intel 486, officially named i486 and also known as 80486, is a microprocessor. It is a higher-performance follow-up to the Intel 386. The i486 was introduced in 1989. It represents the fourth generation of binary compatible CPUs following t ...

) and

Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comput ...

architectures this is implemented as the compare and exchange (CMPXCHG) instruction (on a multiprocessor the prefix must be used). As of 2013, most

multiprocessor Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There ar ...

architectures support CAS in hardware, and the compare-and-swap operation is the most popular

synchronization primitive In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. ''Process synchronization'' refers to the idea that multiple processes are to join up or hands ...

for implementing both lock-based and non-blocking

concurrent data structure In computer science, a concurrent data structure is a particular way of storing and organizing data for access by multiple computing threads (or processes) on a computer. Historically, such data structures were used on uniprocessor machines ...

s. The atomic counter and atomic bitmask operations in the Linux kernel typically use a compare-and-swap instruction in their implementation. The SPARC-V8 and

PA-RISC PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to as ...

architectures are two of the very few recent architectures that do not support CAS in hardware; the Linux port to these architectures uses a

Implementation in C

Many C compilers support using compare-and-swap either with the C11 functions, or some non-standard C extension of that particular C compiler, or by calling a function written directly in assembly language using the compare-and-swap instruction. The following C function shows the basic behavior of a compare-and-swap variant that returns the old value of the specified memory location; however, this version does not provide the crucial guarantees of atomicity that a real compare-and-swap operation would: int compare_and_swap(int* reg, int oldval, int newval) old_reg_val is always returned, but it can be tested following the compare_and_swap operation to see if it matches oldval, as it may be different, meaning that another process has managed to succeed in a competing compare_and_swap to change the reg value from oldval. For example, an election protocol can be implemented such that every process checks the result of compare_and_swap against its own PID (= newval). The winning process finds the compare_and_swap returning the initial non-PID value (e.g., zero). For the losers it will return the winning PID. bool compare_and_swap(int *accum, int *dest, int newval) This is the logic in the Intel Software Manual Vol 2A.

Extensions

Since CAS operates on a single pointer-sized memory location, while most lock-free and wait-free algorithms need to modify multiple locations, several extensions have been implemented. ; Double compare-and-swap (DCAS): Compares two unrelated memory locations with two expected values, and if they're equal, sets both locations to new values. The generalization of DCAS to multiple (non-adjacent) words is called MCAS or CASN. DCAS and MCAS are of practical interest in the convenient (concurrent) implementation of some data structures like dequeues or

binary search tree In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and ...

s.Keir Fraser (2004), "Practical lock-freedom
UCAM-CL-TR-579.pdf
/ref> DCAS and MCAS may be implemented however using the more expressive hardware

transactional memory In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database tra ...

present in some recent processors such as IBM

POWER8 POWER8 is a family of superscalar multi-core microprocessors based on the Power ISA, announced in August 2013 at the Hot Chips conference. The designs are available for licensing under the OpenPOWER Foundation, which is the first time for ...

or in Intel processors supporting Transactional Synchronization Extensions (TSX). ; Double-wide compare-and-swap: Operates on two adjacent pointer-sized locations (or, equivalently, one location twice as big as a pointer). On later x86 processors, the CMPXCHG8B and CMPXCHG16B instructions serve this role, although early 64-bit AMD CPUs did not support CMPXCHG16B (modern AMD CPUs do). Some Intel motherboards from the

Core 2 Intel Core 2 is the processor family encompassing a range of Intel's consumer 64-bit x86-64 single-, dual-, and quad-core microprocessors based on the Core microarchitecture. The single- and dual-core models are single-die, whereas the quad-co ...

era also hamper its use, even though the processors support it. These issues came into the spotlight at the launch of

Windows 8.1 Windows 8.1 is a release of the Windows NT operating system developed by Microsoft. It was released to manufacturing on August 27, 2013, and broadly released for retail sale on October 17, 2013, about a year after the retail release of its pre ...

because it required hardware support for CMPXCHG16B. ; Single compare, double swap: Compares one pointer but writes two. The Itanium's cmp8xchg16 instruction implements this, where the two written pointers are adjacent. ; Multi-word compare-and-swap: Is a generalisation of normal compare-and-swap. It can be used to atomically swap an arbitrary number of arbitrarily located memory locations. Usually, multi-word compare-and-swap is implemented in software using normal double-wide compare-and-swap operations. The drawback of this approach is a lack of scalability.

References

External links

Basic algorithms implemented using CAS

* * * * 2003 discussio
"Lock-Free using cmpxchg8b..."
on Intel x86, with pointers to various papers and source code

Implementations of CAS

* Java package {{Javadoc:SE, package=java.util.concurrent.atomic, java/util/concurrent/atomic implements 'compareAndSet' in various classes * .NET Class method
Interlocked::CompareExchange
* Windows AP
InterlockedCompareExchange
Computer arithmetic Concurrency control