Fermi is the codename for a

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

(GPU)

microarchitecture In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular ...

developed by

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

, first released to retail in April 2010, as the successor to the

Tesla Tesla most commonly refers to: * Nikola Tesla (1856–1943), a Serbian-American electrical engineer and inventor * Tesla, Inc., an American electric vehicle and clean energy company, formerly Tesla Motors, Inc. * Tesla (unit) (symbol: T), the SI-d ...

microarchitecture. It was the primary microarchitecture used in the

GeForce 400 series The GeForce 400 series is a series of graphics processing units developed by Nvidia, serving as the introduction of the Fermi microarchitecture. Its release was originally slated in November 2009, however, after delays, it was released on March ...

and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11. Fermi was followed by

Kepler Johannes Kepler (27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws of p ...

, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and

GeForce 800 series The GeForce 800M series is a family of graphics processing units by Nvidia for laptop PCs. It consists of rebrands of mobile versions of the GeForce 700 series and some newer chips that are lower end compared to the rebrands. The GeForce 800 ...

, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the

Quadro Quadro was Nvidia's brand for graphics cards intended for use in workstations running professional computer-aided design (CAD), computer-generated imagery (CGI), digital content creation (DCC) applications, scientific calculations and machine l ...

x000 series, Quadro NVS models, and in

Nvidia Tesla Nvidia Tesla is the former name for a line of products developed by Nvidia targeted at stream processing or GPGPU, general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began us ...

computing modules. The architecture is named after

Enrico Fermi Enrico Fermi (; 29 September 1901 – 28 November 1954) was an Italian and naturalized American physicist, renowned for being the creator of the world's first artificial nuclear reactor, the Chicago Pile-1, and a member of the Manhattan Project ...

, an Italian physicist.

Overview

Fermi Graphic Processing Units (

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

s) feature 3.0 billion transistors and a schematic is sketched in Fig. 1. *Streaming Multiprocessor (SM): composed of 32

CUDA In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...

cores (see Streaming Multiprocessor and CUDA core sections). *GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution (see Warp Scheduling section). *Host interface: connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of 8 GB/s). *DRAM: supported up to 6 GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability (see Memory Architecture section). *Clock frequency: 1.5 GHz (not released by Nvidia, but estimated by Insight 64). *Peak performance: 1.5 TFlops. *Global memory clock: 2 GHz. *DRAM

bandwidth Bandwidth commonly refers to: * Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range * Bandwidth (computing), the rate of data transfer, bit rate or thr ...

: 192 GB/s. *

H.264 Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, is a video compression standard based on block-oriented, motion-compensated coding. It is by far the most commonly used format for the recording, compression, and d ...

FHD decode support.

Streaming multiprocessor

Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64 KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection).

Load/Store Units

Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to

cache Cache, caching, or caché may refer to: Science and technology * Cache (computing), a technique used in computer storage for easier data access * Cache (biology) or hoarding, a food storing behavior of animals * Cache (archaeology), artifacts p ...

DRAM Dram, DRAM, or drams may refer to: Technology and engineering * Dram (unit), a unit of mass and volume, and an informal name for a small amount of liquor, especially whisky or whiskey * Dynamic random-access memory, a type of electronic semicondu ...

Special Functions Units (SFUs)

Execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

CUDA core

Integer Arithmetic Logic Unit (ALU)

Supports full 32-bit precision for all instructions, consistent with standard programming language requirements. It is also optimized to efficiently support 64-bit in workstation and server models, but artificially crippled for consumer versions.

Floating Point Unit (FPU)

Implements the new IEEE 754-2008 floating-point standard, providing the

fused multiply-add Fuse or FUSE may refer to: Devices * Fuse (electrical), a device used in electrical systems to protect against excessive current ** Fuse (automotive), a class of fuses for vehicles * Fuse (hydraulic), a device used in hydraulic systems to protec ...

(FMA) instruction for both single and double precision arithmetic. Up to 16 double precision fused multiply-add operations can be performed per SM, per clock.

Fused multiply-add

Fused multiply-add Fuse or FUSE may refer to: Devices * Fuse (electrical), a device used in electrical systems to protect against excessive current ** Fuse (automotive), a class of fuses for vehicles * Fuse (hydraulic), a device used in hydraulic systems to protec ...

(FMA) perform multiplication and addition (i.e., A*B+C) with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately.

Warp scheduling

The Fermi architecture uses a two-level, distributed thread scheduler. Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. 1. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies. 64-bit floating point operations require both the first two execution columns, so run at half the speed of 32-bit operations.

Dual Warp Scheduler

At the SM level, each warp scheduler distributes warps of 32 threads to its execution units. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs. Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently.

Double precision Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point arithmetic, floating-point computer number format, number format, usually occupying 64 Bit, bits in computer memory; it represents a wide range of numeri ...

instructions do not support dual dispatch with any other operation.

Performance

The theoretical

single-precision Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floati ...

processing power of a Fermi GPU in

GFLOPS Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measu ...

is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × shader clock speed (in GHz). Note that the previous generation

could dual-issue MAD+MUL to CUDA cores and SFUs in parallel, but Fermi lost this ability as it can only issue 32 instructions per cycle per SM which keeps just its 32 CUDA cores fully utilized. Therefore, it is not possible to leverage the SFUs to reach more than 2 operations per CUDA core per cycle. The theoretical double-precision processing power of a Fermi GPU is 1/2 of the single precision performance on GF100/110. However, in practice this double-precision power is only available on professional

and

cards, while consumer

GeForce GeForce is a brand of graphics processing units (GPUs) designed by Nvidia and marketed for the performance market. As of the GeForce 50 series, there have been nineteen iterations of the design. In August 2017, Nvidia stated that "there are o ...

cards are capped to 1/8.

Memory

L1 cache per SM and unified L2 cache that services all operations (load, store and texture).

Registers

Each SM has 32K of 32-bit registers. Each thread has access to its own registers and not those of other threads. The maximum number of registers that can be used by a CUDA kernel is 63. The number of available registers degrades gracefully from 63 to 21 as the workload (and hence resource requirements) increases by number of threads. Registers have a very high bandwidth: about 8,000 GB/s.

L1+Shared Memory

On-chip memory that can be used either to cache data for individual threads (register spilling/L1 cache) and/or to share data among several threads (shared memory). This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache. Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data, and greatly reduces off-chip traffic. Shared memory is accessible by the threads in the same thread block. It provides low-latency access (10-20 cycles) and very high

(1,600 GB/s) to moderate amounts of data (such as intermediate results in a series of calculations, one row or column of data for matrix operations, a line of video, etc.). David Patterson says that this Shared Memory uses idea of local scratchpad

Local Memory

Local memory is meant as a memory location used to hold "spilled" registers. Register spilling occurs when a thread block requires more register storage than is available on an SM. Local memory is used only for some automatic variables (which are declared in the device code without any of the __device__, __shared__, or __constant__ qualifiers). Generally, an automatic variable resides in a register except for the following: (1) Arrays that the compiler cannot determine are indexed with constant quantities; (2) Large structures or arrays that would consume too much register space; Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM.

L2 Cache

768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests. The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.

Global memory

Global memory (VRAM) is accessible by all threads directly as well as the host system over the PCIe bus. It has a high latency of 400-800 cycles.

Video decompression/compression

See

Nvidia NVDEC NVDEC (formerly known as NVCUVID) is a feature in its graphics cards that performs video decoding, offloading this compute-intensive task from the CPU. NVDEC is a successor of PureVideo and is available in Kepler and later Nvidia GPUs. It is ac ...

(formerly called NVCUVID) as well as

Nvidia PureVideo PureVideo is Nvidia's hardware SIP core that performs video decoding. PureVideo is integrated into some of the Nvidia GPUs, and it supports hardware decoding of multiple video codec standards: MPEG-2, VC-1, H.264, HEVC, and AV1. PureVideo occu ...

. The

Nvidia NVENC NVENC (short for Nvidia Encoder) is a feature in Nvidia graphics cards that performs video encoding, offloading this compute-intensive task from the CPU to a dedicated part of the GPU. It was introduced with the Kepler-based GeForce 600 series ...

technology was not available yet, but introduced in the successor,

Fermi chips

* GF100 * GF104 * GF106 * GF108 * GF110 * GF114 * GF116 * GF117 * GF119

References

General

* N. Brookwood
"NVIDIA Solves the GPU Computing Puzzle."
* P.N. Glaskowsky
"NVIDIA’s Fermi: The First Complete GPU Computing Architecture."
* N. Whitehead, A. Fit-Florea
"Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs."
2011. * * R. Farber, "CUDA Application Design and Development," Morgan Kaufmann, 2011. * NVIDIA Application Note "Tuning CUDA applications for Fermi".

External links

NVIDIA Fermi Architecture on Orange Owl Solutions
{{Nvidia Nvidia microarchitectures Nvidia Fermi