Non-uniform memory access (NUMA) is a

computer memory In computing, memory is a device or system that is used to store information for immediate use in a computer or related computer hardware and digital electronic devices. The term ''memory'' is often synonymous with the term '' primary storag ...

design used in

multiprocessing Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There ar ...

, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users. NUMA architectures logically follow in scaling from

symmetric multiprocessing Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all ...

(SMP) architectures. They were developed commercially during the 1990s by

Unisys Unisys Corporation is an American multinational information technology (IT) services and consulting company headquartered in Blue Bell, Pennsylvania. It provides digital workplace solutions, cloud, applications, and infrastructure solutions, ...

Convex Computer Convex Computer Corporation was a company that developed, manufactured and marketed vector minisupercomputers and supercomputers for small-to-medium-sized businesses. Their later Exemplar series of parallel computing machines were based on the ...

(later

Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...

Honeywell Honeywell International Inc. is an American publicly traded, multinational conglomerate corporation headquartered in Charlotte, North Carolina. It primarily operates in four areas of business: aerospace, building technologies, performance ma ...

Information Systems Italy (HISI) (later

Groupe Bull Bull SAS (also known as Groupe Bull, Bull Information Systems, or simply Bull) is a French computer company headquartered in Les Clayes-sous-Bois, in the western suburbs of Paris. The company has also been known at various times as Bull General E ...

Silicon Graphics Silicon Graphics, Inc. (stylized as SiliconGraphics before 1999, later rebranded SGI, historically known as Silicon Graphics Computer Systems or SGCS) was an American high-performance computing manufacturer, producing computer hardware and soft ...

(later

Silicon Graphics International Silicon Graphics International Corp. (SGI; formerly Rackable Systems, Inc.) was an American manufacturer of computer hardware and software, including high-performance computing systems, x86-based servers for datacenter deployment, and visualiz ...

Sequent Computer Systems Sequent Computer Systems was a computer company that designed and manufactured multiprocessing computer systems. They were among the pioneers in high-performance symmetric multiprocessing (SMP) open systems, innovating in both hardware (e.g., ca ...

(later IBM),

Data General Data General Corporation was one of the first minicomputer firms of the late 1960s. Three of the four founders were former employees of Digital Equipment Corporation (DEC). Their first product, 1969's Data General Nova, was a 16-bit minicompute ...

(later EMC, now Dell Technologies), and

Digital Digital usually refers to something using discrete digits, often binary digits. Technology and computing Hardware *Digital electronics, electronic circuits which operate using digital signals ** Digital camera, which captures and stores digital ...

(later

Compaq Compaq Computer Corporation (sometimes abbreviated to CQ prior to a 2007 rebranding) was an American information technology company founded in 1982 that developed, sold, and supported computers and related products and services. Compaq produced ...

, then HP, now HPE). Techniques developed by these companies later featured in a variety of

Unix-like A Unix-like (sometimes referred to as UN*X or *nix) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Unix-li ...

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...

s, and to an extent in

Windows NT Windows NT is a proprietary graphical operating system produced by Microsoft, the first version of which was released on July 27, 1993. It is a processor-independent, multiprocessing and multi-user operating system. The first version of Win ...

. The first commercial implementation of a NUMA-based Unix system was the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation for

Honeywell Information Systems Honeywell International Inc. is an American publicly traded, multinational conglomerate corporation headquartered in Charlotte, North Carolina. It primarily operates in four areas of business: aerospace, building technologies, performance m ...

Italy.

Overview

Modern CPUs operate considerably faster than the main memory they use. In the early days of computing and data processing, the CPU generally ran slower than its own memory. The performance lines of processors and memory crossed in the 1960s with the advent of the first

supercomputer A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructio ...

s. Since then, CPUs increasingly have found themselves "starved for data" and having to stall while waiting for data to arrive from memory (e.g. for Von-Neumann architecture-based computers, see

Von Neumann bottleneck The von Neumann architecture — also known as the von Neumann model or Princeton architecture — is a computer architecture based on a 1945 description by John von Neumann, and by others, in the '' First Draft of a Report on the EDVAC''. T ...

). Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach. Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speed

cache memory In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...

and using increasingly sophisticated algorithms to avoid

cache miss In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...

es. But the dramatic increase in size of the operating systems and of the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer's memory at a time. NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for

server Server may refer to: Computing *Server (computing), a computer program or a device that provides functionality for other programs or devices, called clients Role * Waiting staff, those who work at a restaurant or a bar attending customers and su ...

s and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks). Another approach to addressing this problem is the

multi-channel memory architecture In the fields of digital electronics and computer hardware, multi-channel memory architecture is a technology that increases the data transfer rate between the DRAM memory and the memory controller by adding more channels of communication betwee ...

, in which a linear increase in the number of memory channels increases the memory access concurrency linearly. Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between memory banks. This operation slows the processors attached to those banks, so the overall speed increase due to NUMA heavily depends on the nature of the running tasks.

Implementations

AMD implemented NUMA with its

Opteron Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture (known generically as x86-64 or AMD64). It was released on April 22, 2003, with the ''Sledg ...

processor (2003), using

HyperTransport HyperTransport (HT), formerly known as Lightning Data Transport, is a technology for interconnection of computer processors. It is a bidirectional serial/parallel high-bandwidth, low- latency point-to-point link that was introduced on April 2 ...

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 ser ...

announced NUMA compatibility for its x86 and

Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comput ...

servers in late 2007 with its Nehalem and Tukwila CPUs. Both Intel CPU families share a common chipset; the interconnection is called Intel QuickPath Interconnect (QPI), which provides extremely high bandwidth to enable high on-board scalability and was replaced by a new version called Intel UltraPath Interconnect with the release of Skylake (2017).

Cache coherent NUMA (ccNUMA)

Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit

locality of reference In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...

in memory accesses. With NUMA, maintaining

cache coherence In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, wh ...

across shared memory has a significant overhead. Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard

von Neumann architecture The von Neumann architecture — also known as the von Neumann model or Princeton architecture — is a computer architecture based on a 1945 description by John von Neumann, and by others, in the '' First Draft of a Report on the EDVAC''. T ...

programming model. Typically, ccNUMA uses inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA may perform poorly when multiple processors attempt to access the same memory area in rapid succession. Support for NUMA in

s attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary. Alternatively, cache coherency protocols such as the

MESIF protocol The MESIF protocol is a cache coherency and memory coherence protocol developed by Intel for cache coherent non-uniform memory architectures. The protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F ...

attempt to reduce the communication required to maintain cache coherency.

Scalable Coherent Interface The Scalable Coherent Interface or Scalable Coherent Interconnect (SCI), is a high-speed interconnect standard for shared memory multiprocessing and message passing. The goal was to scale well, provide system-wide memory coherence and a simple in ...

(SCI) is an

IEEE The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operati ...

standard defining a directory-based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. For example, SCI is used as the basis for the NumaConnect technology.

NUMA vs. cluster computing

One can view NUMA as a tightly coupled form of cluster computing. The addition of

virtual memory In computing, virtual memory, or virtual storage is a memory management technique that provides an "idealized abstraction of the storage resources that are actually available on a given machine" which "creates the illusion to users of a very ...

paging to a cluster architecture can allow the implementation of NUMA entirely in software. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater (slower) than that of hardware-based NUMA.

Software support

Since NUMA largely influences memory access performance, certain software optimizations are needed to allow scheduling threads and processes close to their in-memory data. *

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washi ...

Windows 7 Windows 7 is a major release of the Windows NT operating system developed by Microsoft. It was Software release life cycle#Release to manufacturing (RTM), released to manufacturing on July 22, 2009, and became generally available on October 22, ...

and

Windows Server 2008 R2 Windows Server 2008 R2 is the fifth version of the Windows Server operating system produced by Microsoft and released as part of the Windows NT family of operating systems. It was released to manufacturing on July 22, 2009, and became generall ...

added support for NUMA architecture over 64 logical cores. *

Java 7 The Java language has undergone several changes since JDK 1.0 as well as numerous additions of classes and packages to the standard library. Since J2SE 1.4, the evolution of the Java language has been governed by the Java Community ...

added support for NUMA-aware memory allocator and

garbage collector A waste collector, also known as a garbageman, garbage collector, trashman (in the US), binman or (rarely) dustman (in the UK), is a person employed by a public or private enterprise to collect and dispose of municipal solid waste (refuse) and ...

. *

Linux kernel The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel. It was originally authored in 1991 by Linus Torvalds for his i386-based PC, and it was soon adopted as the kernel for the GNU ...

: **Version 2.5 provided a basic NUMA support, which was further improved in subsequent kernel releases. **Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases. **Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having memory pages shared between processes, or the use of transparent huge pages; new sysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters. *

OpenSolaris OpenSolaris () is a discontinued open-source computer operating system based on Solaris and created by Sun Microsystems. It was also, perhaps confusingly, the name of a project initiated by Sun to build a developer and user community around t ...

models NUMA architecture with lgroups. *

FreeBSD FreeBSD is a free and open-source Unix-like operating system descended from the Berkeley Software Distribution (BSD), which was based on Research Unix. The first version of FreeBSD was released in 1993. In 2005, FreeBSD was the most popular ...

added support for NUMA architecture in version 9.0. *

IRIX IRIX ( ) is a discontinued operating system developed by Silicon Graphics (SGI) to run on the company's proprietary MIPS workstations and servers. It is based on UNIX System V with BSD extensions. In IRIX, SGI originated the XFS file system a ...

(discontinued as of 2021) support for ccNUMA architecture over 1240 CPU with Origin server series.

Hardware support

As of 2011, ccNUMA systems are multiprocessor systems based on the

AMD Opteron Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture (known generically as x86-64 or AMD64). It was released on April 22, 2003, with the ''Sledge ...

processor, which can be implemented without external logic, and the Intel Itanium processor, which requires the chipset to support NUMA. Examples of ccNUMA-enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and those found in NEC Itanium-based systems. Earlier ccNUMA systems such as those from

were based on MIPS processors and the DEC Alpha 21364 (EV7) processor.

References

External links

NUMA FAQ

OpenSolaris NUMA Project

Introduction video for the Alpha EV7 system architecture

More videos related to EV7 systems: CPU, IO, etc

NUMA optimization in Windows Applications

NUMA Support in Linux at SGI

Intel Tukwila

Intel QPI (CSI) explained

current Itanium NUMA systems
{{Parallel Computing Parallel computing Computer memory

Overview

Implementations

Cache coherent NUMA (ccNUMA)

NUMA vs. cluster computing

Software support

Hardware support

See also

References

External links