
Hardware acceleration is the use of
computer hardware
Computer hardware includes the physical parts of a computer, such as the central processing unit (CPU), random-access memory (RAM), motherboard, computer data storage, graphics card, sound card, and computer case. It includes external devices ...
designed to perform specific functions more efficiently when compared to
software
Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications.
The history of software is closely tied to the development of digital comput ...
running on a general-purpose
central processing unit
A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary Processor (computing), processor in a given computer. Its electronic circuitry executes Instruction (computing), instructions ...
(CPU). Any
transformation of
data
Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.
To perform computing tasks more efficiently, generally one can invest time and money in improving the software, improving the hardware, or both. There are various approaches with advantages and disadvantages in terms of decreased
latency, increased
throughput
Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel in a communication network, such as Ethernet or packet radio. The data that these messages contain may be delivered ov ...
, and reduced
energy consumption
Energy consumption is the amount of energy used.
Biology
In the body, energy consumption is part of energy homeostasis. It derived from food energy. Energy consumption in the body is a product of the basal metabolic rate and the physical acti ...
. Typical advantages of focusing on software may include greater versatility, more rapid
development
Development or developing may refer to:
Arts
*Development (music), the process by which thematic material is reshaped
* Photographic development
*Filmmaking, development phase, including finance and budgeting
* Development hell, when a proje ...
, lower
non-recurring engineering
Non-recurring engineering (NRE) cost refers to the one-time cost to research, design, develop and test a new product or product enhancement. When budgeting for a new product, NRE must be considered to analyze if a new product will be profitable. ...
costs, heightened
portability, and ease of
updating features or
patching
bugs, at the cost of
overhead to compute general operations. Advantages of focusing on hardware may include
speedup
In computer architecture, speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with ...
, reduced
power consumption
Electric energy consumption is energy consumption in the form of electrical energy. About a fifth of global energy is consumed as electricity: for residential, industrial, commercial, transportation and other purposes.
The global electricity con ...
, lower latency, increased
parallelism and
bandwidth, and
better utilization of area and
functional components available on an
integrated circuit
An integrated circuit (IC), also known as a microchip or simply chip, is a set of electronic circuits, consisting of various electronic components (such as transistors, resistors, and capacitors) and their interconnections. These components a ...
; at the cost of lower ability to update designs once
etched onto silicon and higher costs of
functional verification, times to market, and the need for more parts. In the hierarchy of digital computing systems ranging from general-purpose processors to
fully customized hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by
orders of magnitude
In a ratio scale based on powers of ten, the order of magnitude is a measure of the nearness of two figures. Two numbers are "within an order of magnitude" of each other if their ratio is between 1/10 and 10. In other words, the two numbers are wi ...
when any given application is implemented higher up that hierarchy. This hierarchy includes general-purpose processors such as CPUs, more specialized processors such as programmable
shader
In computer graphics, a shader is a computer program that calculates the appropriate levels of light, darkness, and color during the rendering of a 3D scene—a process known as '' shading''. Shaders have evolved to perform a variety of s ...
s in a
GPU, applications implemented on
field-programmable gate arrays (FPGAs), and fixed-function implemented on
application-specific integrated circuit
An application-specific integrated circuit (ASIC ) is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficienc ...
s (ASICs).
Hardware acceleration is advantageous for
performance
A performance is an act or process of staging or presenting a play, concert, or other form of entertainment. It is also defined as the action or process of carrying out or accomplishing an action, task, or function.
Performance has evolved glo ...
, and practical when the functions are fixed, so updates are not as needed as in software solutions. With the advent of
reprogrammable logic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing
control flow
In computer science, control flow (or flow of control) is the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated. The emphasis on explicit control flow distinguishes an '' ...
.
The disadvantage, however, is that in many open source projects, it requires proprietary libraries that not all vendors are keen to distribute or expose, making it difficult to integrate in such projects.
Overview
Integrated circuits
An integrated circuit (IC), also known as a microchip or simply chip, is a set of electronic circuits, consisting of various electronic components (such as transistors, resistors, and capacitors) and their interconnections. These components a ...
are designed to handle various operations on both analog and digital signals. In computing, digital signals are the most common and are typically represented as binary numbers.
Computer hardware
Computer hardware includes the physical parts of a computer, such as the central processing unit (CPU), random-access memory (RAM), motherboard, computer data storage, graphics card, sound card, and computer case. It includes external devices ...
and software use this
binary representation to perform computations. This is done by processing
Boolean functions
In mathematics, a Boolean function is a function whose arguments and result assume values from a two-element set (usually , or ). Alternative names are switching function, used especially in older computer science literature, and truth functi ...
on the binary input, and then outputting the results for storage or further processing by other devices.
Computational equivalence of hardware and software
Because all
Turing machines
A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer alg ...
can run any
computable function
Computable functions are the basic objects of study in computability theory. Informally, a function is ''computable'' if there is an algorithm that computes the value of the function for every value of its argument. Because of the lack of a precis ...
, it is always possible to design custom hardware that performs the same function as a given piece of software. Conversely, software can always be used to emulate the function of a given piece of hardware. Custom hardware may offer higher performance per watt for the same functions that can be specified in software.
Hardware description language
In computer engineering, a hardware description language (HDL) is a specialized computer language used to describe the structure and behavior of electronic circuits, usually to design application-specific integrated circuits (ASICs) and to progra ...
s (HDLs) such as
Verilog
Verilog, standardized as IEEE 1364, is a hardware description language (HDL) used to model electronic systems. It is most commonly used in the design and verification of digital circuits, with the highest level of abstraction being at the re ...
and
VHDL
VHDL (Very High Speed Integrated Circuit Program, VHSIC Hardware Description Language) is a hardware description language that can model the behavior and structure of Digital electronics, digital systems at multiple levels of abstraction, ran ...
can model the same
semantics
Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
as software and
synthesize the design into a
netlist
In electronic design, a netlist is a description of the connectivity of an electronic circuit. In its simplest form, a netlist consists of a list of the electronic components in a circuit and a list of the nodes they are connected to. A netwo ...
that can be programmed to an FPGA or composed into the
logic gate
A logic gate is a device that performs a Boolean function, a logical operation performed on one or more binary inputs that produces a single binary output. Depending on the context, the term may refer to an ideal logic gate, one that has, for ...
s of an ASIC.
Stored-program computers
The vast majority of software-based computing occurs on machines implementing the
von Neumann architecture
The von Neumann architecture—also known as the von Neumann model or Princeton architecture—is a computer architecture based on the '' First Draft of a Report on the EDVAC'', written by John von Neumann in 1945, describing designs discus ...
, collectively known as
stored-program computer
A stored-program computer is a computer that stores program instructions in electronically, electromagnetically, or optically accessible memory. This contrasts with systems that stored the program instructions with plugboards or similar mechani ...
s.
Computer program
A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. It is one component of software, which also includes software documentation, documentation and other intangibl ...
s are stored as data and
executed by
processors. Such processors must fetch and decode instructions, as well as
load data operands from
memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...
(as part of the
instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch–execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...
), to execute the instructions constituting the software program. Relying on a common
cache for code and data leads to the "von Neumann bottleneck", a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the
modified Harvard architecture, where instructions and data have separate caches in the
memory hierarchy
In computer architecture, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and contr ...
, there is overhead to decoding instruction
opcode
In computing, an opcode (abbreviated from operation code) is an enumerated value that specifies the operation to be performed. Opcodes are employed in hardware devices such as arithmetic logic units (ALUs), central processing units (CPUs), and ...
s and
multiplexing
In telecommunications and computer networking, multiplexing (sometimes contracted to muxing) is a method by which multiple analog or digital signals are combined into one signal over a shared medium. The aim is to share a scarce resource� ...
available
execution units on a
microprocessor
A microprocessor is a computer processor (computing), processor for which the data processing logic and control is included on a single integrated circuit (IC), or a small number of ICs. The microprocessor contains the arithmetic, logic, a ...
or
microcontroller
A microcontroller (MC, uC, or μC) or microcontroller unit (MCU) is a small computer on a single integrated circuit. A microcontroller contains one or more CPUs (processor cores) along with memory and programmable input/output peripherals. Pro ...
, leading to low circuit utilization. Modern processors that provide
simultaneous multithreading
Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern proces ...
exploit under-utilization of available processor functional units and
instruction level parallelism between different hardware threads.
Hardware execution units
Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an
instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch–execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...
and incur those stages' overhead. If needed calculations are specified in a
register transfer level (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.
This reclamation saves time, power, and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication, or memory, as well as increased
input/output
In computing, input/output (I/O, i/o, or informally io or IO) is the communication between an information processing system, such as a computer, and the outside world, such as another computer system, peripherals, or a human operator. Inputs a ...
capabilities. This comes at the cost of general-purpose utility.
Emerging hardware architectures
Greater RTL customization of hardware designs allows emerging architectures such as
in-memory computing,
transport triggered architectures (TTA) and
networks-on-chip (NoC) to further benefit from increased
locality of data to execution context, thereby reducing computing and communication latency between modules and functional units.
Custom hardware is limited in parallel processing capability only by the area and
logic block
In computing, a logic block or configurable logic block (CLB) is a fundamental building block of field-programmable gate array (FPGA) technology. Logic blocks can be configured by the engineer to provide reconfigurable computing, reconfigurable lo ...
s available on the
integrated circuit die. Therefore, hardware is much more free to offer
massive parallelism than software on general-purpose processors, offering a possibility of implementing the
parallel random-access machine
In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confuse ...
(PRAM) model.
It is common to build
multicore and
manycore processing units out of
microprocessor IP core schematics on a single FPGA or ASIC. Similarly, specialized functional units can be composed in parallel, as
in digital signal processing, without being embedded in a processor
IP core. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little
conditional branching, especially on large amounts of data. This is how
Nvidia
Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
's
CUDA
In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...
line of GPUs are implemented.
Implementation metrics
As device mobility has increased, new metrics have been developed that measure the relative performance of specific acceleration protocols, considering characteristics such as physical hardware dimensions, power consumption, and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed.
Applications
Examples of hardware acceleration include
bit blit acceleration functionality in graphics processing units (GPUs), use of
memristor
A memristor (; a portmanteau of ''memory resistor'') is a non-linear two-terminal electrical component relating electric charge and magnetic flux linkage. It was described and named in 1971 by Leon Chua, completing a theoretical quartet of ...
s for accelerating
neural networks
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
, and
regular expression
A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
hardware acceleration for
spam control in the
server industry, intended to prevent
regular expression denial of service
A regular expression denial of service (ReDoS)
is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many Regu ...
(ReDoS) attacks.
The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit called a hardware accelerator, though they are usually referred to with a more specific term, such as 3D accelerator, or
cryptographic accelerator.
Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled by
instruction fetch (for example, moving temporary results
to and from a
register file). Hardware accelerators improve the execution of a specific algorithm by allowing greater
concurrency, having specific
datapaths for their
temporary variables, and reducing the overhead of instruction control in the fetch-decode-execute cycle.
Modern processors are
multi-core
A multi-core processor (MCP) is a microprocessor on a single integrated circuit (IC) with two or more separate central processing units (CPUs), called ''cores'' to emphasize their multiplicity (for example, ''dual-core'' or ''quad-core''). Ea ...
and often feature parallel "single-instruction; multiple data" (
SIMD
Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
) units. Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like
motion estimation
In computer vision and image processing, motion estimation is the process of determining ''motion vectors'' that describe the transformation from one 2D image to another; usually from adjacent video frame, frames in a video sequence. It is an wel ...
in
MPEG-2
MPEG-2 (a.k.a. H.222/H.262 as was defined by the ITU) is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods ...
).
Hardware acceleration units by application
See also
*
Coprocessor
A coprocessor is a computer processor used to supplement the functions of the primary processor (the CPU). Operations performed by the coprocessor may be floating-point arithmetic, graphics, signal processing, string processing, cryptography or ...
*
DirectX Video Acceleration (DXVA)
*
Direct memory access
Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system computer memory, memory independently of the central processing unit (CPU).
Without DMA, when the CPU is using programmed i ...
(DMA)
*
High-level synthesis
High-level synthesis (HLS), sometimes referred to as C synthesis, electronic system-level (ESL) synthesis, algorithmic synthesis, or behavioral synthesis, is an automated design process that takes an abstract behavioral specification of a digital ...
**
C to HDL
C to HDL tools convert C language or C-like computer code into a hardware description language (HDL) such as VHDL or Verilog. The converted code can then be synthesized and translated into a hardware device such as a field-programmable gate ...
**
Flow to HDL
*
Soft microprocessor
A soft microprocessor (also called softcore microprocessor or a soft processor) is a microprocessor core that can be wholly implemented using logic synthesis. It can be implemented via different semiconductor devices containing programmable logic ...
*
Flynn's taxonomy
Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972. The classification system has stuck, and it has been used as a tool in the design of modern processors and their functionalit ...
of parallel
computer architectures
**
Single instruction, multiple data
Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
(SIMD)
**
Single instruction, multiple threads (SIMT)
**
Multiple instructions, multiple data (MIMD)
*
Computer for operations with functions
References
External links
*
{{Digital electronics
Application-specific integrated circuits
Central processing unit
Computer optimization
Gate arrays
Graphics hardware
Articles with example C code