Permute (and Shuffle) instructions, part of
bit manipulation
Bit manipulation is the act of algorithmically manipulating bits or other pieces of data shorter than a word. Computer programming tasks that require bit manipulation include low-level device control, error detection and correction algori ...
as well as
vector processing
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.
There exists two important permute variants, known as gather and scatter, respectively. The gather variant is as follows:
for i = 0 to length-1
dest = src ndices[i
where the scatter variant is:
for i = 0 to length-1
dest ndices[i = src[i]
Note that unlike in memory-based Gather-scatter (vector addressing), gather-scatter all three of
dest
,
src
, and
indices
are ''registers'' (or parts of registers in the case of bit-level permute), not memory locations.
The scatter variant can be seen to "scatter" the source elements ''across'' (into) to the destination, where the "gather" variant is gathering data ''from'' the indexed source elements.
Given that the indices may be repeated in both variants, the resultant output is not a strict mathematical
permutation
In mathematics, a permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or p ...
because duplicates can occur in the output.
A special case of permute is also used in GPU "
swizzling" (again, not strictly a permutation) which performs on-the-fly reordering of subvector data so as to align or duplicate elements with the appropriate
SIMD lane Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are ex ...
.
Occurrences of permute instructions
Permute instructions occur in both
scalar processor
Scalar processors are a class of computer processors that process only one data item at a time. Typical data items include integers and floating point numbers.
Classification
A scalar processor is classified as a single instruction, single da ...
s as well as
vector processing
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
engines as well as
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...
. In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in
RISC-V
RISC-V (pronounced "risk-five" where five refers to the number of generations of RISC architecture that were developed at the University of California, Berkeley since 1981) is an open standard instruction set architecture (ISA) based on establi ...
vectors, and take Vectors as input for both source elements and source array, and output another Vector.
In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked,
SIMD
Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...
style) where the fragments are then used as array sources. The (small, partial) results are then concatenated (packed) back into the scalar register as the result.
Some ISAs, particularly for
cryptographic
Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...
applications, even have ''bit-level'' permute operations, such as
bdep
(bit deposit) in RISC-V bitmanip; in the
Power ISA
Power ISA is a reduced instruction set computer (RISC) instruction set architecture (ISA) currently developed by the OpenPOWER Foundation, led by IBM. It was originally developed by IBM and the now-defunct Power.org industry group. Power ISA ...
it is known as and has been included for several decades, and is still in the Power ISA v.3.0 B spec.
Also in some non-vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions. Examples include
VSHUFF32x4
from
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; ...
.
Permute operations in different forms are surprisingly common, occurring in
AltiVec,
Power ISA
Power ISA is a reduced instruction set computer (RISC) instruction set architecture (ISA) currently developed by the OpenPOWER Foundation, led by IBM. It was originally developed by IBM and the now-defunct Power.org industry group. Power ISA ...
,
PowerPC G4
PowerPC G4 is a designation formerly used by Apple and Eyetech to describe a ''fourth generation'' of 32-bit PowerPC microprocessors. Apple has applied this name to various (though closely related) processor models from Freescale, a former part ...
,
AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing) and Skylake-X CPUs; ...
,
SVE2, vector processors, and
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...
. They are sufficiently important that
LLVM
LLVM is a set of compiler and toolchain technologies that can be used to develop a front end for any programming language and a back end for any instruction set architecture. LLVM is designed around a language-independent intermediate repre ...
added the
intrinsic and
GCC added the intrinsic. GCC's intrinsic matches the functionality of
OpenCL
OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-prog ...
's shuffle intrinsics.
Note that all of these, mathematically, are not permutations because duplicates can occur in the output.
See also
*
*
References
External links
Unofficial convenient list of AVX-512 shuffle instructionsOfficial Intel AVX-512 ISA ManualARM NEON tutorial on data rearrangement
{{Multimedia extensions
Binary arithmetic
Computer arithmetic