Permute (and Shuffle) instructions, part of bit manipulation as well as

vector processing In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its Instruction (computer science), instructions are designed to operate efficiently and effectively on large Array d ...

, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size. There exists two important permute variants, known as gather and scatter, respectively. The gather variant is as follows: for i = 0 to length-1 dest = src ndices[i where the scatter variant is: for i = 0 to length-1 dest ndices[i = src[i] Note that unlike in memory-based Gather-scatter (vector addressing), gather-scatter all three of dest, src, and indices are ''registers'' (or parts of registers in the case of bit-level permute), not memory locations. The scatter variant can be seen to "scatter" the source elements ''across'' (into) to the destination, where the "gather" variant is gathering data ''from'' the indexed source elements. Given that the indices may be repeated in both variants, the resultant output is not a strict mathematical

permutation In mathematics, a permutation of a set can mean one of two different things: * an arrangement of its members in a sequence or linear order, or * the act or process of changing the linear order of an ordered set. An example of the first mean ...

because duplicates can occur in the output. A special case of permute is also used in GPU " swizzling" (again, not strictly a permutation) which performs on-the-fly reordering of subvector data so as to align or duplicate elements with the appropriate SIMD lane.

Occurrences of permute instructions

Permute instructions occur in both scalar processors as well as

engines as well as

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

. In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in

RISC-V RISC-V (pronounced "risk-five") is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. The project commenced in 2010 at the University of California, Berkeley. It transfer ...

vectors, and take Vectors as input for both source elements and source array, and output another Vector. In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked,

SIMD Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...

style) where the fragments are then used as array sources. The (small, partial) results are then concatenated (packed) back into the scalar register as the result. Some ISAs, particularly for

cryptographic Cryptography, or cryptology (from "hidden, secret"; and ''graphein'', "to write", or '' -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adversarial behavior. More gen ...

applications, even have ''bit-level'' permute operations, such as bdep (bit deposit) in RISC-V bitmanip; in the

Power ISA Power ISA is a reduced instruction set computer (RISC) instruction set architecture (ISA) currently developed by the OpenPOWER Foundation, led by IBM. It was originally developed by IBM and the now-defunct Power.org industry group. Power IS ...

it is known as and has been included for several decades, and is still in the Power ISA v.3.0 B spec. Also in some non-vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions. Examples include VSHUFF32x4 from

AVX-512 AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing), and then ...

. Permute operations in different forms are surprisingly common, occurring in AltiVec,

PowerPC G4 PowerPC G4 is a designation formerly used by Apple Inc., Apple to describe a ''fourth generation'' of 32-bit PowerPC microprocessors. Apple has applied this name to various (though closely related) processor models from Freescale Semiconductor, Fr ...

, SVE2, vector processors, and

. They are sufficiently important that

LLVM LLVM, also called LLVM Core, is a target-independent optimizer and code generator. It can be used to develop a Compiler#Front end, frontend for any programming language and a Compiler#Back end, backend for any instruction set architecture. LLVM i ...

added the intrinsic and GCC added the intrinsic. GCC's intrinsic matches the functionality of

OpenCL OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...

's shuffle intrinsics. Note that all of these, mathematically, are not permutations because duplicates can occur in the output.

References

External links

Unofficial convenient list of AVX-512 shuffle instructions

Official Intel AVX-512 ISA Manual

ARM NEON tutorial on data rearrangement
{{Multimedia extensions Binary arithmetic Computer arithmetic

Occurrences of permute instructions

See also

References

External links