CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach called general-purpose computing on GPUs ( GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of

compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...

s. CUDA is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. CUDA-powered GPUs also support programming frameworks such as OpenMP, OpenACC and OpenCL; and HIP by compiling such code to CUDA. CUDA was created by

Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...

. When it was first introduced, the name was an acronym for Compute Unified Device Architecture, but Nvidia later dropped the common use of the acronym.

Background

The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of

real-time Real-time or real time describes various operations in computing or other processes that must guarantee response times within a specified time (deadline), usually a relatively short time. A real-time process is generally one that happens in defined ...

high-resolution 3D graphics compute-intensive tasks. By 2012, GPUs had evolved into highly parallel multi-core systems allowing efficient manipulation of large blocks of data. This design is more effective than general-purpose

central processing unit A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, a ...

(CPUs) for

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

s in situations where processing large blocks of data is done in parallel, such as: *

cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography: * the probability of a particular n-bit output ...

s *

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

molecular dynamics Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic "evolution" of th ...

simulations * physics engines * sort algorithms

Programming abilities

The CUDA platform is accessible to software developers through CUDA-accelerated libraries,

compiler directives In computer programming, a directive or pragma (from "pragmatic") is a language construct that specifies how a compiler (or other translator) should process its input. Directives are not part of the grammar of a programming language, and may vary ...

such as OpenACC, and extensions to industry-standard programming languages including C, C++ and Fortran. C/C++ programmers can use 'CUDA C/C++', compiled to PTX with nvcc, Nvidia's LLVM-based C/C++ compiler, or by clang itself. Fortran programmers can use 'CUDA Fortran', compiled with the PGI CUDA Fortran compiler from The Portland Group. In addition to libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA platform supports other computational interfaces, including the Khronos Group's OpenCL, Microsoft's DirectCompute, OpenGL Compute Shader and

C++ AMP C, or c, is the third letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''cee'' (pronounced ), plural ''cees''. History "C" ...

. Third party wrappers are also available for Python,

Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offic ...

, Fortran,

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...

Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called ...

, Lua, Common Lisp, Haskell, R,

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

, IDL, Julia, and native support in Mathematica. In the computer game industry, GPUs are used for graphics rendering, and for game physics calculations (physical effects such as debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate non-graphical applications in computational biology,

cryptography Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or '' -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adv ...

and other fields by an

order of magnitude An order of magnitude is an approximation of the logarithm of a value relative to some contextually understood reference value, usually 10, interpreted as the base of the logarithm and the representative of values of magnitude one. Logarithmic di ...

or more. CUDA provides both a low level API (CUDA Driver API, non single-source) and a higher level API (CUDA Runtime API, single-source). The initial CUDA SDK was made public on 15 February 2007, for

Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ...

and

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, whi ...

. Mac OS X support was later added in version 2.0, which supersedes the beta released February 14, 2008. CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line. CUDA is compatible with most standard operating systems. CUDA 8.0 comes with the following libraries (for compilation & runtime, in alphabetical order): * cuBLAS – CUDA Basic Linear Algebra Subroutines library * CUDART – CUDA Runtime library * cuFFT – CUDA Fast Fourier Transform library * cuRAND – CUDA Random Number Generation library * cuSOLVER – CUDA based collection of dense and sparse direct solvers * cuSPARSE – CUDA Sparse Matrix library * NPP – NVIDIA Performance Primitives library * nvGRAPH – NVIDIA Graph Analytics library * NVML – NVIDIA Management Library * NVRTC – NVIDIA Runtime Compilation library for CUDA C++ CUDA 8.0 comes with these other software components: * nView – NVIDIA nView Desktop Management Software * NVWMI – NVIDIA Enterprise Management Toolkit * GameWorks PhysX – is a multi-platform game physics engine CUDA 9.0–9.2 comes with these other components: * CUTLASS 1.0 – custom linear algebra algorithms, * ~~NVCUVID~~ – NVIDIA Video Decoder was deprecated in CUDA 9.2; it is now available in NVIDIA Video Codec SDK CUDA 10 comes with these other components: * nvJPEG – Hybrid (CPU and GPU) JPEG processing CUDA 11.0-11.8 comes with these other components: * CUB is new one of more supported C++ libraries * MIG multi instance GPU support * nvJPEG2000 – JPEG 2000 encoder and decoder

Advantages

CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU) using graphics APIs: * Scattered reads code can read from arbitrary addresses in memory. * Unified virtual memory (CUDA 4.0 and above) * Unified memory (CUDA 6.0 and above) * Shared memory CUDA exposes a fast shared memory region that can be shared among threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups. * Faster downloads and readbacks to and from the GPU * Full support for integer and bitwise operations, including integer texture lookups

Limitations

* Whether for the host computer or the GPU device, all CUDA source code is now processed according to C++ syntax rules. This was not always the case. Earlier versions of CUDA were based on C syntax rules. As with the more general case of compiling C code with a C++ compiler, it is therefore possible that old C-style CUDA source code will either fail to compile or will not behave as originally intended. * Interoperability with rendering languages such as OpenGL is one-way, with OpenGL having access to registered CUDA memory but CUDA not having access to OpenGL memory. * Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly alleviated with asynchronous memory transfers, handled by the GPU's DMA engine). * Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands. Branches in the program code do not affect performance significantly, provided that each of 32 threads takes the same execution path; the SIMD execution model becomes a significant limitation for any inherently divergent task (e.g. traversing a space partitioning data structure during ray tracing). * No emulator or fallback functionality is available for modern revisions. * Valid C++ may sometimes be flagged and prevent compilation due to the way the compiler approaches optimization for target GPU device limitations. * C++ run-time type information (RTTI) and C++-style exception handling are only supported in host code, not in device code. * In single-precision on first generation CUDA compute capability 1.x devices, denormal numbers are unsupported and are instead flushed to zero, and the precision of both the division and square root operations are slightly lower than IEEE 754-compliant single precision math. Devices that support compute capability 2.0 and above support denormal numbers, and the division and square root operations are IEEE 754 compliant by default. However, users can obtain the prior faster gaming-grade math of compute capability 1.x devices if desired by setting compiler flags to disable accurate divisions and accurate square roots, and enable flushing denormal numbers to zero. * Unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia. Attempts to implement CUDA on other GPUs include: ** Project Coriander: Converts CUDA C++11 source to OpenCL 1.2 C. A fork of CUDA-on-CL intended to run TensorFlow. ** CU2CL: Convert CUDA 3.2 C++ to OpenCL C. ** GPUOpen HIP: A thin abstraction layer on top of CUDA and ROCm intended for AMD and Nvidia GPUs. Has a conversion tool for importing CUDA C++ source. Supports CUDA 4.0 plus C++11 and float16.

GPUs supported

Supported CUDA level of GPU and card. * CUDA SDK 1.0 support for compute capability 1.0 – 1.1 (Tesla) * CUDA SDK 1.1 support for compute capability 1.0 – 1.1+x (Tesla) * CUDA SDK 2.0 support for compute capability 1.0 – 1.1+x (Tesla) * CUDA SDK 2.1 – 2.3.1 support for compute capability 1.0 – 1.3 (Tesla) * CUDA SDK 3.0 – 3.1 support for compute capability 1.0 – 2.0 (Tesla, Fermi) * CUDA SDK 3.2 support for compute capability 1.0 – 2.1 (Tesla, Fermi) * CUDA SDK 4.0 – 4.2 support for compute capability 1.0 – 2.1+x (Tesla, Fermi, more?). * CUDA SDK 5.0 – 5.5 support for compute capability 1.0 – 3.5 (Tesla, Fermi, Kepler). * CUDA SDK 6.0 support for compute capability 1.0 – 3.5 (Tesla, Fermi, Kepler). * CUDA SDK 6.5 support for compute capability 1.1 – 5.x (Tesla, Fermi, Kepler, Maxwell). Last version with support for compute capability 1.x (Tesla). * CUDA SDK 7.0 – 7.5 support for compute capability 2.0 – 5.x (Fermi, Kepler, Maxwell). * CUDA SDK 8.0 support for compute capability 2.0 – 6.x (Fermi, Kepler, Maxwell, Pascal). Last version with support for compute capability 2.x (Fermi). * CUDA SDK 9.0 – 9.2 support for compute capability 3.0 – 7.0 (Kepler, Maxwell, Pascal, Volta) * CUDA SDK 10.0 – 10.2 support for compute capability 3.0 – 7.5 (Kepler, Maxwell, Pascal, Volta, Turing). Last version with support for compute capability 3.0 and 3.2 (Kepler in part). 10.2 is the last official release for macOS, as support will not be available for macOS in newer releases. * CUDA SDK 11.0 support for compute capability 3.5 – 8.0 (Kepler (in part), Maxwell, Pascal, Volta, Turing, Ampere (in part)). * CUDA SDK 11.1 – 11.4 support for compute capability 3.5 – 8.6 (Kepler (in part), Maxwell, Pascal, Volta, Turing, Ampere (in part)). * CUDA SDK 11.5 – 11.7.1 support for compute capability 3.5 – 8.7 (Kepler (in part), Maxwell, Pascal, Volta, Turing, Ampere). * CUDA SDK 11.8 support for compute capability 3.5 – 9.0 (Kepler (in part), Maxwell, Pascal, Volta, Turing, Ampere, Ada Lovelace, Hopper). * CUDA SDK 12.0 support for compute capability 5.0 – 9.0 (Maxwell, Pascal, Volta, Turing, Ampere, Ada Lovelace, Hopper) '*' – OEM-only products

Version features and specifications

Data types

Note: Any missing lines or empty entries do reflect some lack of information on that exact item.

Tensor cores

Note: Any missing lines or empty entries do reflect some lack of information on that exact item.

Technical Specification

Multiprocessor Architecture

For more information read the Nvidia CUDA programming guide.

Example

This example code in C++ loads a texture from an image into an array on the GPU: texture tex; void foo() //end foo() __global__ void kernel(float* odata, int height, int width) Below is an example given in Python that computes the product of two arrays on the GPU. The unofficial Python language bindings can be obtained from ''PyCUDA''. import pycuda.compiler as comp import pycuda.driver as drv import numpy import pycuda.autoinit mod = comp.SourceModule( """ __global__ void multiply_them(float *dest, float *a, float *b) """ ) multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them(drv.Out(dest), drv.In(a), drv.In(b), block=(400, 1, 1)) print(dest - a * b) Additional Python bindings to simplify matrix multiplication operations can be found in the program ''pycublas''. import numpy from pycublas import CUBLASMatrix A = CUBLASMatrix(numpy.mat( 1, 2, 3 , 5, 6, numpy.float32)) B = CUBLASMatrix(numpy.mat( 2, 3

, 5 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

, 7, numpy.float32)) C = A * B print(C.np_mat()) while CuPy directly replaces NumPy: import cupy a = cupy.random.randn(400) b = cupy.random.randn(400) dest = cupy.zeros_like(a) print(dest - a * b)

Current and future usages of CUDA architecture

* Accelerated rendering of 3D graphics * Accelerated interconversion of video file formats * Accelerated

encryption In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Ideally, only authorized parties can d ...

, decryption and compression * Bioinformatics, e.g. NGS DNA sequencing

BarraCUDA A barracuda, or cuda for short, is a large, predatory, ray-finned fish known for its fearsome appearance and ferocious behaviour. The barracuda is a saltwater fish of the genus ''Sphyraena'', the only genus in the family Sphyraenidae, which ...

* Distributed calculations, such as predicting the native conformation of proteins * Medical analysis simulations, for example virtual reality based on CT and MRI scan images * Physical simulations, in particular in fluid dynamics * Neural network training in

problems * Face recognition * Volunteer computing projects, such as SETI@home and other projects using BOINC software *

Molecular dynamics Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic "evolution" of th ...

* Mining cryptocurrencies * Structure from motion (SfM) software

References

External links

* __FORCETOC__ {{DEFAULTSORT:Cuda Computer physics engines GPGPU GPGPU libraries Graphics hardware Nvidia software Parallel computing Graphics cards Video game hardware