X86 SIMD Instruction Listings
   HOME

TheInfoList



OR:

The
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
instruction set In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, s ...
has several times been extended with SIMD (
Single instruction, multiple data Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with
Pentium MMX The Pentium (also referred to as the i586 or P5 Pentium) is a microprocessor introduced by Intel on March 22, 1993. It is the first CPU using the Pentium brand. Considered the fifth generation in the x86 (8086) compatible line of processors, s ...
in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.


Summary of SIMD extensions

The main SIMD instruction set extensions that have been introduced for x86 are:


MMX instructions and extended variants thereof

These instructions are, unless otherwise noted, available in the following forms: * MMX: 64-bit vectors, operating on mm0..mm7 registers (aliased on top of the old x87 register file) * SSE2: 128-bit vectors, operating on xmm0..xmm15 registers (xmm0..xmm7 in 32-bit mode) * AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix. (AVX introduced 256-bit vector registers, but the full width of these vectors was in general not made available for integer SIMD instructions until AVX2.) * AVX2: 256-bit vectors, operating on ymm0..ymm15 registers (extended versions of the xmm0..xmm15 registers) * AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers (zmm0..zmm15 are extended versions of the ymm0..ymm15 registers, while zmm16..zmm31 are new to AVX-512). AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register (the lane width varies from one instruction to another). AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using. For many of the instruction mnemonics, (V) is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.


Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof


MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof


SSE instructions and extended variants thereof


Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

For the instructions in the below table, the following considerations apply unless otherwise noted: * Packed instructions are available at all vector lengths (128-bit for SSE2, 128/256-bit for AVX, 128/256/512-bit for AVX-512) * FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2. * The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset. * For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations. (Broadcasts are available for vector operations only.) From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR, XORPS, and XORPD - these are intended for use on integer, FP32, and FP64 data, respectively.) , , , rowspan=2 , , rowspan=2 , , rowspan=2 , rowspan=2 , MOVLPD x,m64 , , rowspan=2 } , , rowspan=2 , rowspan=2 , , rowspan=2 , , rowspan=2 , rowspan=2 , - , MOVLPS x,m64 , , , , , - , Store 64 bits to memory from lower half of XMM register , 0F 13 /r , MOVLPS m64,x , , , , , , , , , , MOVLPD m64,x , , , , , , , , , , , - , Unpack and interleave low-order floating-point values , 0F 14 /r , , , , , , , , , , , , , , , , , , , , - , Unpack and interleave high-order floating-point values , , 0F 15 /r , , , , , , , , , , , , , , , , , , , , , , - , rowspan=2 , Load 64 bits from memory or lower half of XMM register into the upper half of XMM register while keeping the lower half unchanged , rowspan=2 , 0F 16 /r , MOVLHPS x,x , , , , , rowspan=2 , , rowspan=2 , , rowspan=2 , rowspan=2 , MOVHPD x,m64 , , rowspan=2 , , rowspan=2 , rowspan=2 , , rowspan=2 , , rowspan=2 , rowspan=2 , - , MOVHPS x,m64 , , , , , - , Store 64 bits to memory from upper half of XMM register , 0F 17 /r , MOVHPS m64,x , , , , , , , , , , MOVHPD m64,x , , , , , , , , , , , - ! colspan=2 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , - , Aligned load from memory or vector register , , MOVAPS x,x/m128 , , , , , , , , , , MOVAPD x,x/m128 , , , , , , , , , , , - , Aligned store to memory or vector register , 0F 29 /r , MOVAPS x/m128,x , , , , , , , , , , MOVAPD x/m128,x , , , , , , , , , , , - , Integer to floating-point conversion using general-registers, MMX-registers or memory as source , , 0F 2A /r , , , , , , , , , , , , , , , , , , , , , RC , - , Non-temporal store to memory from vector register.

The packed variants require aligned memory addresses even in VEX/EVEX-encoded forms.

, , 0F 2B /r , MOVNTPS m128,x , , , , , , , , , , MOVNTPD m128,x , , , , , , , , , , , - , Floating-point to integer conversion with truncation, using general-purpose registers or MMX-registers as destination , 0F 2C /r , , , , , , , , , , , , , , , , , , , , , SAE , - , Floating-point to integer conversion, using general-purpose registers or MMX-registers as destination , 0F 2D /r , , , , , , , , , , , , , , , , , , , , , RC , - , Unordered compare floating-point values and set EFLAGS.

Compares the bottom lanes of xmm vector registers.

, 0F 2E /r , UCOMISS x,x/m32 , , , , , , , , , , UCOMISD x,x/m64 , , , , , , , , , , SAE , - , Compare floating-point values and set EFLAGS.

Compares the bottom lanes of xmm vector registers.

, 0F 2F /r , COMISS x,x/m32 , , , , , , , , , , COMISD x,x/m64 , , , , , , , , , , SAE , - ! colspan=2 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , - , Extract packed floating-point sign mask , 0F 50 /r , , , , , , , , , , , , , , , , , , , , , , - , Floating-point Square Root , 0F 51 /r , SQRTPS x,x/m128 , , , , , SQRTSS x,x/m32 , , , , , SQRTPD x,x/m128 , , , , , SQRTSD x,x/m64 , , , , , RC , - , Reciprocal Square Root Approximation , 0F 52 /r , , , , , , , , , , , , , , , , , , , , , , - , Reciprocal Approximation , 0F 53 /r , RCPPS x,x/m128 , , , , , RCPSS x,x/m32 , , , , , , , , , , , , , , , , - , Vector bitwise AND , 0F 54 /r , ANDPS x,x/m128 , , , , } , , , , , , ANDPD x,x/m128 , , , , } , , , , , , , - , Vector bitwise AND-NOT , 0F 55 /r , ANDNPS x,x/m128 , , , , , , , , , , ANDNPD x,x/m128 , , , , , , , , , , , - , Vector bitwise OR , 0F 56 /r , ORPS x,x/m128 , , , , , , , , , , ORPD x,x/m128 , , , , , , , , , , , - , Vector bitwise XOR , 0F 57 /r , XORPS x,x/m128 , , , , , , , , , , XORPD x,x/m128 , , , , , , , , , , , - ! colspan=2 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , - , Floating-point Add , 0F 58 /r , ADDPS x,x/m128 , , , , , ADDSS x,x/m32 , , , , , ADDPD x,x/m128 , , , , , ADDSD x,x/m64 , , , , , RC , - , Floating-point Multiply , 0F 59 /r , MULPS x,x/m128 , , , , , MULSS x,x/m32 , , , , , MULPD x,x/m128 , , , , , MULSD x,x/m64 , , , , , RC , - , Convert between floating-point formats
(FP32→FP64, FP64→FP32) , 0F 5A /r ,
(SSE2) , , , , ,
(SSE2) , , , , , , , , , , , , , , , SAE,
RC , - , Floating-point Subtract , 0F 5C /r , SUBPS x,x/m128 , , , , , SUBSS x,x/m32 , , , , , SUBPD x,x/m128 , , , , , SUBSD x,x/m64 , , , , , RC , - , Floating-point Minimum Value , 0F 5D /r , MINPS x,x/m128 , , , , , MINSS x,x/m32 , , , , , MINPD x,x/m128 , , , , , MINSD x,x/m64 , , , , , SAE , - , Floating-point Divide , 0F 5E /r , DIVPS x,x/m128 , , , , , DIVSS x,x/m32 , , , , , DIVPD x,x/m128 , , , , , DIVSD x,x/m64 , , , , , RC , - , Floating-point Maximum Value , 0F 5F /r , MAXPS x,x/m128 , , , , , MAXSS x,x/m32 , , , , , MAXPD x,x/m128 , , , , , MAXSD x,x/m64 , , , , , SAE , - ! colspan=2 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , , colspan=3 , , - , Floating-point compare. Result is written as all-0s/all-1s values (all-1s for comparison true) to vector registers for SSE/AVX, but opmask register for AVX-512. Comparison function is specified by imm8 argument. , , , , , , , , , , , , , , , , ,
, , , , , SAE , - , Packed Interleaved Shuffle.

Performs a shuffle on each of its two input arguments, then keeps the bottom half of the shuffle result from its first argument and the top half of the shuffle result from its second argument.

, , , , , , , , , , , , , , , , , , , , , ,


Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms: * The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits (VEX.L=0 enocding) - under AVX2, they are (with some exceptions noted with "L=0") also made available with a vector length of 256 bits. * The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction. , , , , , , , , , - , rowspan=6 , Sign-extend packed integers into wider packed integers , 8-bit → 16-bit , , (V)PMOVSXBW xmm,xmm/m64 , , 0F38 20 /r , , , , , , , , BW, , 16, , , - , 8-bit → 32-bit , , (V)PMOVSXBD xmm,xmm/m32 , , 0F38 21 /r , , , , , , , , F, , 32, , , - , 8-bit → 64-bit , , (V)PMOVSXBQ xmm,xmm/m16 , , 0F38 22 /r , , , , , , , , F, , 64, , , - , 16-bit → 32-bit , , (V)PMOVSXWD xmm,xmm/m64 , , 0F38 23 /r , , , , , , , , F, , 32, , , - , 16-bit → 64-bit , , (V)PMOVSXWQ xmm,xmm/m32 , , 0F38 24 /r , , , , , , , , F, , 64, , , - , 32-bit → 64-bit , , (V)PMOVSXDQ xmm,xmm/m64 , , 0F38 25 /r , , , , , , , , F, , 64, , , - ! colspan=10 , , - , colspan=2 , Multiply packed 32-bit signed integers, store full 64-bit result.

The input integers are taken from the low 32 bits of each 64-bit vector lane.

, , (V)PMULDQ xmm,xmm/m128 , , 0F38 28 /r , , , , , , , , F, , 64, , 64 , - , colspan=2 , Compare packed 64-bit integers for equality , , (V)PCMPEQQ xmm,xmm/m128 , , 0F38 29 /r , , , , , , , , F, , 64, , 64 , - , colspan=2 , Aligned non-temporal vector load from memory. , , (V)MOVNTDQA xmm,m128 , , 0F38 2A /r , , , , , , , , F, , , , , - , colspan=2 , Pack 32-bit unsigned integers to 16-bit, with saturation , , , , 0F38 2B /r , , , , , , , , BW, , 16, , 32 , - ! colspan=10 , , - , rowspan=6 , Zero-extend packed integers into wider packed integers , 8-bit → 16-bit , , (V)PMOVZXBW xmm,xmm/m64 , , 0F38 30 /r , , , , , , , , BW, , 16, , , - , 8-bit → 32-bit , , (V)PMOVZXBD xmm,xmm/m32 , , 0F38 31 /r , , , , , , , , F, , 32, , , - , 8-bit → 64-bit , , (V)PMOVZXBQ xmm,xmm/m16 , , 0F38 32 /r , , , , , , , , F, , 64, , , - , 16-bit → 32-bit , , (V)PMOVZXWD xmm,xmm/m64 , , 0F38 33 /r , , , , , , , , F, , 32, , , - , 16-bit → 64-bit , , (V)PMOVZXWQ xmm,xmm/m32 , , 0F38 34 /r , , , , , , , , F, , 64, , , - , , , (V)PMOVZXDQ xmm,xmm/m64 , , 0F38 35 /r , , , , , , , , F, , 64, , , - ! colspan=10 , , - , rowspan=3 , Packed minimum-value of signed integers , 8-bit , , (V)PMINSB xmm,xmm/m128 , , 0F38 38 /r , , , , , , , , BW, , 8, , , - , 32-bit , (V)PMINSD xmm,xmm/m128 , rowspan=2 , 0F38 39 /r , rowspan=2 , rowspan=2 , , , F, , 32, , 32 , - , 64-bit , VPMINSQ xmm,xmm/m128(AVX-512) , , , F, , 64, , 64 , - , rowspan=3 , Packed minimum-value of unsigned integers , 16-bit , , (V)PMINUW xmm,xmm/m128 , , 0F38 3A /r , , , , , , , , BW, , 16, , , - , 32-bit , (V)PMINUD xmm,xmm/m128
, rowspan=2 , 0F38 3B /r , rowspan=2 , rowspan=2 , , , F, , 32, , 32 , - , 64-bit , VPMINUQ xmm,xmm/m128(AVX-512) , , , F, , 64, , 64 , - , rowspan=3 , Packed maximum-value of signed integers , 8-bit , , (V)PMAXSB xmm,xmm/m128 , , 0F38 3C /r , , , , , , , , BW, , 8, , , - , 32-bit , (V)PMAXSD xmm,xmm/m128 , rowspan=2 , 0F38 3D /r , rowspan=2 , rowspan=2 , , , F, , 32, , 32 , - , 64-bit , VPMAXSQ xmm,xmm/m128(AVX-512) , , , F, , 64, , 64 , - , rowspan=3 , Packed maximum-value of unsigned integers , 16-bit , , (V)PMAXUW xmm,xmm/m128 , , 0F38 3E /r , , , , , , , , BW, , 16, , , - , 32-bit , (V)PMAXUD xmm,xmm/m128
, rowspan=2 , 0F38 3F /r , rowspan=2 , rowspan=2 , , , F, , 32, , 32 , - , 64-bit , VPMAXUQ xmm,xmm/m128(AVX-512) , , , F, , 64, , 64 , - ! colspan=10 , , - , rowspan=2 colspan=2 , Multiply packed 32/64-bit integers, store low half of results , rowspan=2 , (V)PMULLD mm,mm/m64
(AVX-512) , rowspan=2 , 0F38 40 /r , rowspan=2 , rowspan=2 , , , F , , 32 , , 32 , - , , , DQ , , 64 , , 64 , - , colspan=2 , Packed Horizontal Word Minimum

Find the smallest 16-bit integer in a packed vector of 16-bit unsigned integers, then return the integer and its index in the bottom two 16-bit lanes of the result vector.

, , , , 0F38 41 /r , , , , , , , , , , , , , - , colspan=2 , Blend Packed Words.

For each 16-bit lane of the result, pick a 16-bit value from either the first or the second source argument depending on the corresponding bit of the imm8.

, , , , , , , , , , , , , , , , , - ! colspan=10 , , - , rowspan=4 , Extract integer from indexed lane of vector register, and store to GPR or memory.

Zero-extended if stored to GPR.

, 8-bit , , , , 0F3A 14 /r ib , , , , , , , , BW , , , , , - , 16-bit , , , , 0F3A 15 /r ib , , , , , , , , BW , , , , , - , 32-bit , (V)PEXTRD r/m32,xmm,imm8 , rowspan=2 , 0F3A 16 /r ib , , , , , DQ , , , , , - , 64-bit
(x86-64) , , , , , , DQ , , , , , - , rowspan=3 , Insert integer from general-purpose register into indexed lane of vector register , 8-bit , , (V)PINSRB xmm,r32/m8,imm8 , , 0F3A 20 /r ib , , , , , , , , BW , , , , , - , 32-bit , (V)PINSRD xmm,r32/m32,imm8 , rowspan=2 , 0F3A 22 /r ib , , , , , DQ , , , , , - , 64-bit
(x86-64) , , , , , , DQ , , , , , - ! colspan=10 , , - , colspan=2 , Compute Multiple Packed Sums of Absolute Difference.

The 128-bit form of this instruction computes 8 sums of absolute differences from sequentially selected groups of four bytes in the first source argument and a selected group of four contiguous bytes in the second source operand, and writes the sums to sequential 16-bit lanes of destination register. If the two source arguments src1 and src2 are considered to be two 16-entry arrays of uint8 values and temp is considered to be an 8-entry array of uint16 values, then the operation of the instruction is:

for i = 0 to 7 do
    temp := 0
    for j = 0 to 3 do
         a := src1 i+(imm8[24)+j ">.html" ;"title="i+(imm8[2">i+(imm8[24)+j          b := src2[ (imm8[1:0">">i+(imm8[2<_a>4)+j_.html" ;"title=".html" ;"title="i+(imm8[2">i+(imm8[24)+j ">.html" ;"title="i+(imm8[2">i+(imm8[24)+j          b := src2[ (imm8[1:04)+j ]
         temp := temp + abs(a-b)
    done
done
dst := temp
For wider forms of this instruction under AVX2 and AVX10.2, the operation is split into 128-bit lanes where each lane internally performs the same operation as the 128-bit variant of the instruction - except that odd-numbered lanes use bits 5:3 rather than bits 2:0 of the imm8. , , , , , , , , , , , , 10.2 , , 16 , , , - ! colspan=10 , Added with SSE 4.2 , - , colspan=2 , Compare packed 64-bit signed integers for greater-than , , (V)PCMPGTQ xmm, xmm/m128 , , 0F38 37 /r , , , , , , , , F, , 64, , 64 , - , colspan=2 , Packed Compare Explicit Length Strings, Return Mask , , , , 0F3A 60 /r ib , , , , , , , , , , , , , - , colspan=2 , Packed Compare Explicit Length Strings, Return Index , , (V)PCMPESTRI xmm,xmm/m128,imm8 , , 0F3A 61 /r ib , , , , , , , , , , , , , - , colspan=2 , Packed Compare Implicit Length Strings, Return Mask , , (V)PCMPISTRM xmm,xmm/m128,imm8 , , 0F3A 62 /r ib , , , , , , , , , , , , , - , colspan=2 , Packed Compare Implicit Length Strings, Return Index , , (V)PCMPISTRI xmm,xmm/m128,imm8 , , 0F3A 63 /r ib , , , , , , , , , , , ,


Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512. , , , , , , , , , , , - , 64-bit , , BLENDVPD xmm,xmm/m128
, , 66 0F38 15 /r , , , , , , , , , , , , , , , - , rowspan=2 , Rounding of packed floating-point values to integer.

Rounding mode specified by imm8 argument.

, 32-bit , , , , 66 0F3A 08 /r ib , , , , , , , , , , , , , , , - , 64-bit , , , , 66 0F3A 09 /r ib , , , , , , , , , , , , , , , - , rowspan=2 , Rounding of scalar floating-point value to integer. , 32-bit , , , , 66 0F3A 0A /r ib , , , , , , , , , , , , , , , - , 64-bit , , , , 66 0F3A 0B /r ib , , , , , , , , , , , , , , , - , rowspan=2 , Blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the corresponding imm8 bit. , 32-bit , , (V)BLENDPS xmm,xmm/m128,imm8 , , 66 0F3A 0C /r ib , , , , , , , , , , , , , , , - , 64-bit , , , , , , , , , , , , , , , , , , , - , colspan=2 , Extract 32-bit lane of XMM register to general-purpose register or memory location.

Bits :0of imm8 is used to select lane.

, , (V)EXTRACTPS r/m32,xmm,imm8 , , 66 0F3A 17 /r ib , , , , , , , , F , , , , , , , - , colspan=2 , Obtain 32-bit value from source XMM register or memory, and insert into the specified lane of destination XMM register.

If the source argument is an XMM register, then bits :6of the imm8 is used to select which 32-bit lane to select source from, otherwise the specified 32-bit memory value is used. This 32-bit value is then inserted into the destination register lane specified by bits :4of the imm8. After insertion, each 32-bit lane of the destination register may optionally be zeroed out - bits :0of the imm8 provides a bitmap of which lanes to zero out.

, , , 66 0F3A 21 /r ib , , , , , , , , F , , , , , , , - , colspan=2 , 4-component dot-product of 32-bit floating-point values.

Bits :4of the imm8 specify which lanes should participate in the dot-product, bits :0specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)

, , , , 66 0F3A 40 /r ib , , , , , , , , , , , , , , , - , colspan=2 , 2-component dot-product of 64-bit floating-point values.

Bits :4of the imm8 specify which lanes should participate in the dot-product, bits :0specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)

, , , , 66 0F3A 41 /r ib , , , , , , , , , , , , , , , - ! colspan=11 , Added with SSE4a (AMD only) , - , rowspan=2 colspan=2 , 64-bit bitfield insert, using the low 64 bits of XMM registers.

First argument is an XMM register to insert bitfield into, second argument is a source register containing the bitfield to insert (starting from bit 0).

For the 4-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bit-offset to insert bitfield at. For the 2-argument version, the length and offset are instead taken from bits 9:64and 7:72of the second argument, respectively.

, , INSERTQ xmm,xmm,imm8,imm8 , , , , , , , , , , , , , , , , , - , INSERTQ xmm,xmm , , F2 0F 79 /r , , , , , , , , , , , , , , , - , rowspan=2 colspan=2 , 64-bit bitfield extract, from the lower 64 bits of an XMM register.

The first argument serves as both source that bitfield is extracted from and destination that bitfield is written to.

For the 3-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bitfield bit-offset. For the 2-argument version, the second argument is an XMM register that contains bitfield length at bits :0and bit-offset at bits 3:8

, , EXTRQ xmm,imm8,imm8 , , 66 0F 78 /0 ib ib , , , , , , , , , , , , , , , - , EXTRQ xmm,xmm , , 66 0F 79 /r , , , , , , , , , , , , , ,


AVX

AVX were first supported by Intel with Sandy Bridge and by AMD with
Bulldozer A bulldozer or dozer (also called a crawler) is a large tractor equipped with a metal #Blade, blade at the front for pushing material (soil, sand, snow, rubble, or rock) during construction work. It travels most commonly on continuous tracks, ...
. Vector operations on 256 bit registers.


F16C The F16C (previously/informally known as CVT16) instruction set is an x86 instruction set architecture extension which provides support for converting between half-precision and standard IEEE single-precision floating-point formats. History T ...

Half-precision floating-point conversion.


AVX2 Advanced Vector Extensions (AVX, also known as Gesher New Instructions and then Sandy Bridge New Instructions) are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They w ...

Introduced in Intel's Haswell microarchitecture and AMD's
Excavator Excavators are heavy equipment (construction), heavy construction equipment primarily consisting of a backhoe, boom, dipper (or stick), Bucket (machine part), bucket, and cab on a rotating platform known as the "house". The modern excavator's ...
. Expansion of most vector integer SSE and AVX instructions to 256 bits


FMA3 and FMA4 instructions

Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands. FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with
Piledriver Piledriver or pile driver may refer to: *Pile driver, a person trained to use the diesel hammer that drives piles into the ground for foundations and bridges *Piledriver (professional wrestling), a move used in professional wrestling Entertainme ...
, and on
Zhaoxin Zhaoxin (Shanghai Zhaoxin Semiconductor Co., Ltd.; , ) is a fabless semiconductor company, created in 2013 as a joint venture between VIA Technologies and the Shanghai Municipal Government. The company manufactures x86-compatible desktop and ...
CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however,
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing), and then ...
foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction: * vfmadd132sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm1*xmm3)+xmm2 * vfmadd213sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm1)+xmm3 * vfmadd231sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm3)+xmm1 For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in
Sapphire Rapids Sapphire Rapids is a codename for Intel's server (fourth generation Xeon Scalable) and workstation (Xeon W-2400/2500 and Xeon W-3400/3500) processors based on the Golden Cove microarchitecture and produced using Intel 7. It features up to 60 c ...
, adds
FP16 In computing, half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in a ...
variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024, similarly adds
BF16 The bfloat16 (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bi ...
variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants. (For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform. For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's
ModR/M The ModR/M byte is an important part of instruction encoding for the x86 instruction set. Description Opcodes in x86 are generally one-byte, though two-byte instructions and prefixes exist. ModR/M is a byte that, if required, follows the opcode a ...
byte and the fourth operand is a register operand, specified by bits 7:4 of the ''ib'' (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example: * vfmaddsd xmm1,xmm2, emxmm3 will perform xmm1 ← (xmm2* em+xmm3 and require a W=0 encoding. * vfmaddsd xmm1,xmm2,xmm3, em/code> will perform xmm1 ← (xmm2*xmm3)+ em/code> and require a W=1 encoding. * vfmaddsd xmm1,xmm2,xmm3,xmm4 will perform xmm1 ← (xmm2*xmm3)+xmm4 and can be encoded with either W=0 or W=1.
Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:


AVX-512

AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing), and then ...
, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory. Most of the added instructions may also be used with the 256- and 128-bit registers.


AMX

Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a
matrix Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the m ...
, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform
matrix multiplication In mathematics, specifically in linear algebra, matrix multiplication is a binary operation that produces a matrix (mathematics), matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the n ...
s on these registers.


See also

*
x86 instruction listings The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor. The x86 instruction ...
*
List of discontinued x86 instructions Instructions that have at some point been present as documented instructions in one or more x86 processors, but where the processor series containing the instructions are discontinued or superseded, with no known plans to reintroduce the instructi ...
* ARM architecture family#64/32-bit architecture


References

{{reflist


External links


Intel Intrinsics Guide
- searchable reference for Intel MMX/SSE/AVX/AVX512 SIMD intrinsics Instruction set listings