Multiply–accumulate operation
   HOME

TheInfoList



OR:

In
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...
, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC unit); the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator ''a'': :\ a \leftarrow a + ( b \times c ) When done with
floating point In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
numbers, it might be performed with two
rounding Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $ with $, the fraction 312/937 with 1/3, or the expression with . Rounding is often done to ob ...
s (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC). Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in
combinational logic In automata theory, combinational logic (also referred to as time-independent logic or combinatorial logic) is a type of digital logic which is implemented by Boolean circuits, where the output is a pure function of the present input only. This ...
followed by an adder and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers.
Percy Ludgate Percy Edwin Ludgate (2 August 1883 – 16 October 1922) was an Irish amateur scientist who designed the second analytical engine (general-purpose Turing-complete computer) in history. Life Ludgate was born on 2 August 1883 in Skibbereen, ...
was the first to conceive a MAC in his Analytical Machine of 1909, and the first to exploit a MAC for division (using multiplication seeded by reciprocal, via the convergent series ). The first modern processors to be equipped with MAC units were digital signal processors, but the technique is now also common in general-purpose processors.


In floating-point arithmetic

When done with
integer An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the languag ...
s, the operation is typically exact (computed modulo some
power of two A power of two is a number of the form where is an integer, that is, the result of exponentiation with number two as the base and integer  as the exponent. In a context where only integers are considered, is restricted to non-negativ ...
). However, floating-point numbers have only a certain amount of mathematical
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
. That is, digital floating-point arithmetic is generally not associative or distributive. (See .) Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add).
IEEE 754-2008 The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
specifies that it must be performed with one rounding, yielding a more accurate result.


Fused multiply–add

A fused multiply–add (FMA or fmadd) is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product , round it to ''N'' significant bits, add the result to ''a'', and round back to ''N'' significant bits, a fused multiply–add would compute the entire expression to its full precision before rounding the final result down to ''N'' significant bits. A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products: *
Dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alge ...
*
Matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...
* Polynomial evaluation (e.g., with
Horner's rule In mathematics and computer science, Horner's method (or Horner's scheme) is an algorithm for polynomial evaluation. Although named after William George Horner, this method is much older, as it has been attributed to Joseph-Louis Lagrange by Horn ...
) * Newton's method for evaluating functions (from the inverse function) * Convolutions and artificial neural networks * Multiplication in
double-double arithmetic In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision. This 128-bit quadruple precision is desi ...
Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly. If is evaluated as (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the term first) using fused multiply–add, then the result may be negative even when due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated. When implemented inside a
microprocessor A microprocessor is a computer processor where the data processing logic and control is included on a single integrated circuit, or a small number of integrated circuits. The microprocessor contains the arithmetic, logic, and control circ ...
, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2''N''-bit adder to compute the sum properly. Another benefit of including this instruction is that it allows an efficient software implementation of
division Division or divider may refer to: Mathematics *Division (mathematics), the inverse of multiplication *Division algorithm, a method for computing the result of mathematical division Military *Division (military), a formation typically consisting ...
(see
division algorithm A division algorithm is an algorithm which, given two integers N and D, computes their quotient and/or remainder, the result of Euclidean division. Some are applied by hand, while others are employed by digital circuit designs and software. Div ...
) and
square root In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or  ⋅ ) is . For example, 4 and −4 are square roots of 16, because . ...
(see
methods of computing square roots Methods of computing square roots are numerical analysis algorithms for approximating the principal, or non-negative, square root (usually denoted \sqrt, \sqrt /math>, or S^) of a real number. Arithmetically, it means given S, a procedure for fi ...
) operations, thus eliminating the need for dedicated hardware for those operations.


Dot product instruction

Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit
SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it shoul ...
registers a0×b0 + a1×b1 + a2×b2 + a3×b3 with single cycle throughput.


Support

The FMA operation is included in
IEEE 754-2008 The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
. The
Digital Equipment Corporation Digital Equipment Corporation (DEC ), using the trademark Digital, was a major American company in the computer industry from the 1960s to the 1990s. The company was co-founded by Ken Olsen and Harlan Anderson in 1957. Olsen was president un ...
(DEC)
VAX VAX (an acronym for Virtual Address eXtension) is a series of computers featuring a 32-bit instruction set architecture (ISA) and virtual memory that was developed and sold by Digital Equipment Corporation (DEC) in the late 20th century. The V ...
's POLY instruction is used for evaluating polynomials with
Horner's rule In mathematics and computer science, Horner's method (or Horner's scheme) is an algorithm for polynomial evaluation. Although named after William George Horner, this method is much older, as it has been attributed to Joseph-Louis Lagrange by Horn ...
using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step. This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977. The 1999 standard of the C programming language supports the FMA operation through the fma() standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas (). The GCC and
Clang Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection ...
C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma, this can be globally controlled by the -ffp-contract command line option. The fused multiply–add operation was introduced as "multiply–add fused" in the IBM
POWER1 The POWER1 is a multi-chip CPU developed and fabricated by IBM that implemented the POWER instruction set architecture (ISA). It was originally known as the RISC System/6000 CPU or, when in an abbreviated form, the RS/6000 CPU, before introdu ...
(1990) processor, but has been added to numerous other processors since then: * HP PA-8000 (1996) and above * Hitachi SuperH SH-4 (1998) *
SCE SCE is an abbreviation with multiple meanings: Science * Short-channel effect, a secondary effect describing the reduction in threshold voltage Vth in MOSFETs with non-uniformly doped channel regions as the gate length increases * Saturated calomel ...
-
Toshiba , commonly known as Toshiba and stylized as TOSHIBA, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan. Its diversified products and services include power, industrial and social infrastructure systems, ...
Emotion Engine (1999) * Intel
Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comput ...
(2001) * STI
Cell Cell most often refers to: * Cell (biology), the functional basic unit of life Cell may also refer to: Locations * Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery ...
(2006) * Fujitsu SPARC64 VI (2007) and above * ( MIPS-compatible)
Loongson Loongson () is the name of a family of general-purpose, MIPS architecture-compatible microprocessors, as well as the name of the Chinese fabless company (Loongson Technology) that develops them. The processors are alternately called Godson pro ...
-2F (2008) * Elbrus-8SV (2018) * x86 processors with FMA3 and/or FMA4 instruction set ** AMD Bulldozer (2011, FMA4 only) ** AMD Piledriver (2012, FMA3 and FMA4) ** AMD Steamroller (2014) ** AMD Excavator (2015) ** AMD
Zen Zen ( zh, t=禪, p=Chán; ja, text= 禅, translit=zen; ko, text=선, translit=Seon; vi, text=Thiền) is a school of Mahayana Buddhism that originated in China during the Tang dynasty, known as the Chan School (''Chánzong'' 禪宗), and ...
(2017, FMA3 only) ** Intel Haswell (2013, FMA3 only) ** Intel Skylake (2015, FMA3 only) * ARM processors with VFPv4 and/or NEONv2: ** ARM Cortex-M4F (2010) **
ARM Cortex-A5 The ARM Cortex-A5 is a 32-bit processor core licensed by ARM Holdings implementing the ARMv7-A architecture announced in 2009. Overview The Cortex-A5 is intended to replace the ARM9 and ARM11 cores for use in low-end devices. The Cortex-A5 offe ...
(2012) **
ARM Cortex-A7 The ARM Cortex-A7 MPCore is a 32-bit microprocessor core licensed by ARM Holdings implementing the ARMv7-A architecture announced in 2011. Overview It has two target applications; firstly as a smaller, simpler, and more power-efficient success ...
(2013) ** ARM Cortex-A15 (2012) ** Qualcomm Krait (2012) **
Apple A6 The Apple A6 is a 32-bit package on package (PoP) system on a chip (SoC) designed by Apple Inc. that was introduced on September 12, 2012 at the launch of the iPhone 5. Apple states that it is up to twice as fast and has up to twice the graphic ...
(2012) ** All
ARMv8 ARM (stylised in lowercase as arm, formerly an acronym for Advanced RISC Machines and originally Acorn RISC Machine) is a family of reduced instruction set computer (RISC) instruction set architectures for computer processors, configured ...
processors *** Fujitsu A64FX has "Four-operand FMA with Prefix Instruction". * IBM z/Architecture (since 1998) * GPUs and GPGPU boards: ** AMD GPUs (2009) and newer *** TeraScale 2 "Evergreen"-series based ***
Graphics Core Next Graphics Core Next (GCN) is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was la ...
-based ** Nvidia GPUs (2010) and newer *** Fermi-based (2010) *** Kepler-based (2012) *** Maxwell-based (2014) *** Pascal-based (2016) *** Volta-based (2017) ** Intel GPUs since
Sandy Bridge Sandy Bridge is the codename for Intel's 32 nm microarchitecture used in the second generation of the Intel Core processors (Core i7, i5, i3). The Sandy Bridge microarchitecture is the successor to Nehalem and Westmere microarchitecture. ...
** Intel MIC (2012) ** ARM Mali T600 Series (2012) and above * Vector Processors: **
NEC SX-Aurora TSUBASA The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a v ...
*
RISC-V RISC-V (pronounced "risk-five" where five refers to the number of generations of RISC architecture that were developed at the University of California, Berkeley since 1981) is an open standard instruction set architecture (ISA) based on estab ...
instruction set (2010)


References

{{DEFAULTSORT:Multiply-accumulate operation Computer arithmetic Digital signal processing