Permute (and Shuffle) instructions, part of
bit manipulation as well as
vector processing
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its Instruction (computer science), instructions are designed to operate efficiently and effectively on large Array d ...
, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.
There exists two important permute variants, known as gather and scatter, respectively. The gather variant is as follows:
for i = 0 to length-1
dest = src ndices[i
where the scatter variant is:
for i = 0 to length-1
dest ndices[i = src[i]
Note that unlike in memory-based Gather-scatter (vector addressing), gather-scatter all three of
dest
,
src
, and
indices
are ''registers'' (or parts of registers in the case of bit-level permute), not memory locations.
The scatter variant can be seen to "scatter" the source elements ''across'' (into) to the destination, where the "gather" variant is gathering data ''from'' the indexed source elements.
Given that the indices may be repeated in both variants, the resultant output is not a strict mathematical
permutation
In mathematics, a permutation of a set can mean one of two different things:
* an arrangement of its members in a sequence or linear order, or
* the act or process of changing the linear order of an ordered set.
An example of the first mean ...
because duplicates can occur in the output.
A special case of permute is also used in GPU "
swizzling" (again, not strictly a permutation) which performs on-the-fly reordering of subvector data so as to align or duplicate elements with the appropriate
SIMD lane.
Occurrences of permute instructions
Permute instructions occur in both
scalar processors as well as
vector processing
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its Instruction (computer science), instructions are designed to operate efficiently and effectively on large Array d ...
engines as well as
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
. In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in
RISC-V
RISC-V (pronounced "risk-five") is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. The project commenced in 2010 at the University of California, Berkeley. It transfer ...
vectors, and take Vectors as input for both source elements and source array, and output another Vector.
In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked,
SIMD
Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
style) where the fragments are then used as array sources. The (small, partial) results are then concatenated (packed) back into the scalar register as the result.
Some ISAs, particularly for
cryptographic
Cryptography, or cryptology (from "hidden, secret"; and ''graphein'', "to write", or '' -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adversarial behavior. More gen ...
applications, even have ''bit-level'' permute operations, such as
bdep
(bit deposit) in RISC-V bitmanip; in the
Power ISA
Power ISA is a reduced instruction set computer (RISC) instruction set architecture (ISA) currently developed by the OpenPOWER Foundation, led by IBM. It was originally developed by IBM and the now-defunct Power.org industry group. Power IS ...
it is known as and has been included for several decades, and is still in the Power ISA v.3.0 B spec.
Also in some non-vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions. Examples include
VSHUFF32x4
from
AVX-512
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing), and then ...
.
Permute operations in different forms are surprisingly common, occurring in
AltiVec,
Power ISA
Power ISA is a reduced instruction set computer (RISC) instruction set architecture (ISA) currently developed by the OpenPOWER Foundation, led by IBM. It was originally developed by IBM and the now-defunct Power.org industry group. Power IS ...
,
PowerPC G4
PowerPC G4 is a designation formerly used by Apple Inc., Apple to describe a ''fourth generation'' of 32-bit PowerPC microprocessors. Apple has applied this name to various (though closely related) processor models from Freescale Semiconductor, Fr ...
,
AVX-512
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing), and then ...
,
SVE2, vector processors, and
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
. They are sufficiently important that
LLVM
LLVM, also called LLVM Core, is a target-independent optimizer and code generator. It can be used to develop a Compiler#Front end, frontend for any programming language and a Compiler#Back end, backend for any instruction set architecture. LLVM i ...
added the
intrinsic and
GCC added the intrinsic. GCC's intrinsic matches the functionality of
OpenCL
OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...
's shuffle intrinsics.
Note that all of these, mathematically, are not permutations because duplicates can occur in the output.
See also
*
*
References
External links
Unofficial convenient list of AVX-512 shuffle instructionsOfficial Intel AVX-512 ISA ManualARM NEON tutorial on data rearrangement
{{Multimedia extensions
Binary arithmetic
Computer arithmetic