HOME
The Info List - SIMD


--- Advertisement ---



Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment. SIMD
SIMD
is particularly applicable to common tasks such as adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU
CPU
designs include SIMD
SIMD
instructions to improve the performance of multimedia use.

Contents

1 History 2 Advantages 3 Disadvantages 4 Chronology 5 Hardware 6 Software 7 SIMD
SIMD
on the web 8 Commercial applications 9 See also 10 References 11 External links

History[edit] The first use of SIMD
SIMD
instructions was in the vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a "vector" of data with a single instruction. Vector processing was especially popularized by Cray
Cray
in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD
SIMD
machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD
SIMD
machines process all elements of the vector simultaneously.[1] The first era of modern SIMD
SIMD
machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel. For example, each of 65,536 single-bit processors in a Thinking Machines CM-2 would execute the same instruction at the same time, allowing, for instance, to logically combine 65,536 pairs of bits at a time, using a hypercube-connected network or processor-dedicated RAM to find its operands. Supercomputing moved away from the SIMD
SIMD
approach when inexpensive scalar MIMD
MIMD
approaches based on commodity processors such as the Intel i860
Intel i860
XP[2] became more powerful, and interest in SIMD waned. The current era of SIMD
SIMD
processors grew out of the desktop-computer market rather than the supercomputer market. As desktop processors became powerful enough to support real-time gaming and audio/video processing during the 1990s, demand grew for this particular type of computing power, and microprocessor vendors turned to SIMD
SIMD
to meet the demand.[3] Sun Microsystems
Sun Microsystems
introduced SIMD
SIMD
integer instructions in its "VIS" instruction set extensions in 1995, in its UltraSPARC
UltraSPARC
I microprocessor. MIPS followed suit with their similar MDMX system. The first widely deployed desktop SIMD
SIMD
was with Intel's MMX extensions to the x86 architecture in 1996. This sparked the introduction of the much more powerful AltiVec system in the Motorola
Motorola
PowerPC's and IBM's POWER systems. Intel responded in 1999 by introducing the all-new SSE system. Since then, there have been several extensions to the SIMD instruction sets for both architectures. All of these developments have been oriented toward support for real-time graphics, and are therefore oriented toward processing in two, three, or four dimensions, usually with vector lengths of between two and sixteen words, depending on data type and architecture. When new SIMD
SIMD
architectures need to be distinguished from older ones, the newer architectures are then considered "short-vector" architectures, as earlier SIMD
SIMD
and vector supercomputers had vector lengths from 64 to 64,000. A modern supercomputer is almost always a cluster of MIMD machines, each of which implements (short-vector) SIMD
SIMD
instructions. A modern desktop computer is often a multiprocessor MIMD
MIMD
machine where each processor can execute short-vector SIMD
SIMD
instructions. Advantages[edit] An application that may take advantage of SIMD
SIMD
is one where the same value is being added to (or subtracted from) a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three values for the brightness of the red (R), green (G) and blue (B) portions of the color. To change the brightness, the R, G and B values are read from memory, a value is added to (or subtracted from) them, and the resulting values are written back out to memory. With a SIMD
SIMD
processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "retrieve this pixel, now retrieve the next pixel", a SIMD
SIMD
processor will have a single instruction that effectively says "retrieve n pixels" (where n is a number that varies from design to design). For a variety of reasons, this can take much less time than retrieving each pixel individually, as with traditional CPU
CPU
design. Another advantage is that the instruction operates on all loaded data in a single operation. In other words, if the SIMD
SIMD
system works by loading up eight data points at once, the add operation being applied to the data will happen to all eight values at the same time. This parallelism is separate from the parallelism provided by a superscalar processor; the eight values are processed in parallel even on a non-superscalar processor, and a superscalar processor may be able to perform multiple SIMD
SIMD
operations in parallel. Disadvantages[edit]

Not all algorithms can be vectorized easily. For example, a flow-control-heavy task like code parsing may not easily benefit from SIMD; however, it is theoretically possible to vectorize comparisons and "batch flow" to target maximal cache optimality, though this technique will require more intermediate state. Note: Batch-pipeline systems (example: GPUs or software rasterization pipelines) are most advantageous for cache control when implemented with SIMD
SIMD
intrinsics, but they are not exclusive to SIMD
SIMD
features. Further complexity may be apparent to avoid dependence within series such as code strings; while independence is required for vectorization. Large register files which increases power consumption and require chip area. Currently, implementing an algorithm with SIMD
SIMD
instructions usually requires human labor; most compilers don't generate SIMD
SIMD
instructions from a typical C program, for instance. Automatic vectorization in compilers is an active area of computer science research. (Compare vector processing.) Programming with particular SIMD
SIMD
instruction sets can involve numerous low-level challenges.

SIMD
SIMD
may have restrictions on data alignment; programmers familiar with one particular architecture may not expect this. Gathering data into SIMD
SIMD
registers and scattering it to the correct destination locations is tricky (sometimes requiring permute operations) and can be inefficient. Specific instructions like rotations or three-operand addition are not available in some SIMD
SIMD
instruction sets. Instruction sets are architecture-specific: some processors lack SIMD instructions entirely, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them. The early MMX instruction set shared a register file with the floating-point stack, which caused inefficiencies when mixing floating-point and MMX code. However, SSE2 corrects this.

Chronology[edit] Examples of SIMD
SIMD
supercomputers (not including vector processors):

ILLIAC IV, c. 1974 ICL Distributed Array Processor (DAP), c. 1974 Burroughs Scientific Processor, c. 1976 Geometric-Arithmetic Parallel Processor, from Martin Marietta, starting in 1981, continued at Lockheed Martin, then at Teranex and Silicon Optix Massively Parallel Processor (MPP), from NASA/Goddard Space Flight Center, c. 1983-1991 Connection Machine, models 1 and 2 (CM-1 and CM-2), from Thinking Machines Corporation, c. 1985 MasPar
MasPar
MP-1 and MP-2, c. 1987-1996 Zephyr DC computer from Wavetracer, c. 1991 Xplor, from Pyxsys, Inc., c. 2001.

Hardware[edit] Small-scale (64 or 128 bits) SIMD
SIMD
became popular on general-purpose CPUs in the early 1990s and continued through 1997 and later with Motion Video Instructions (MVI) for Alpha. SIMD
SIMD
instructions can be found, to one degree or another, on most CPUs, including the IBM's AltiVec and SPE for PowerPC, HP's PA-RISC
PA-RISC
Multimedia
Multimedia
Acceleration eXtensions (MAX), Intel's MMX and iwMMXt, SSE, SSE2, SSE3 S SSE3 and SSE4.x, AMD's 3DNow!, ARC's ARC Video subsystem, SPARC's VIS and VIS2, Sun's MAJC, ARM's NEON technology, MIPS' MDMX (MaDMaX) and MIPS-3D. The IBM, Sony, Toshiba
Toshiba
co-developed Cell Processor's SPU's instruction set is heavily SIMD
SIMD
based. NXP founded by Philips developed several SIMD
SIMD
processors named Xetal. The Xetal has 320 16bit processor elements especially designed for vision tasks. Modern graphics processing units (GPUs) are often wide SIMD implementations, capable of branches, loads, and stores on 128 or 256 bits at a time. Intel's AVX SIMD
SIMD
instructions now process 256 bits of data at once. Intel's Larrabee prototype microarchitecture includes more than two 5 12-bit SIMD
SIMD
registers on each of its cores (VPU: Wide Vector Processing Units), and this 5 12-bit SIMD
SIMD
capability is being continued in Intel's Many Integrated Core Architecture
Many Integrated Core Architecture
(Intel MIC) and Skylake-X. Software[edit]

The ordinary tripling of four 8-bit numbers. The CPU
CPU
loads one 8-bit number into R1, multiplies it with R2, and then saves the answer from R3 back to RAM. This process is repeated for each number.

The SIMD
SIMD
tripling of four 8-bit numbers. The CPU
CPU
loads 4 numbers at once, multiplies them all in one SIMD-multiplication, and saves them all at once back to RAM. In theory, the speed up is about 75%.

SIMD
SIMD
instructions are widely used to process 3D graphics, although modern graphics cards with embedded SIMD
SIMD
have largely taken over this task from the CPU. Some systems also include permute functions that re-pack elements inside vectors, making them particularly useful for data processing and compression. They are also used in cryptography.[4][5][6] The trend of general-purpose computing on GPUs (GPGPU) may lead to wider use of SIMD
SIMD
in the future. Adoption of SIMD
SIMD
systems in personal computer software was at first slow, due to a number of problems. One was that many of the early SIMD instruction sets tended to slow overall performance of the system due to the re-use of existing floating point registers. Other systems, like MMX and 3DNow!, offered support for data types that were not interesting to a wide audience and had expensive context switching instructions to switch between using the FPU and MMX registers. Compilers also often lacked support, requiring programmers to resort to assembly language coding. SIMD
SIMD
on x86 had a slow start. The introduction of 3DNow! by AMD and SSE by Intel confused matters somewhat, but today the system seems to have settled down (after AMD adopted SSE) and newer compilers should result in more SIMD-enabled software. Intel and AMD now both provide optimized math libraries that use SIMD
SIMD
instructions, and open source alternatives like libSIMD, SIMDx86 and SLEEF have started to appear. Apple Computer had somewhat more success, even though they entered the SIMD
SIMD
market later than the rest. AltiVec offered a rich system and can be programmed using increasingly sophisticated compilers from Motorola, IBM
IBM
and GNU, therefore assembly language programming is rarely needed. Additionally, many of the systems that would benefit from SIMD
SIMD
were supplied by Apple itself, for example iTunes and QuickTime. However, in 2006, Apple computers moved to Intel x86 processors. Apple's APIs and development tools (XCode) were modified to support SSE2 and SSE3 as well as AltiVec. Apple was the dominant purchaser of PowerPC
PowerPC
chips from IBM
IBM
and Freescale Semiconductor
Freescale Semiconductor
and even though they abandoned the platform, further development of AltiVec is continued in several Power Architecture
Power Architecture
designs from Freescale and IBM. On WWDC '15, Apple announced SIMD
SIMD
Vectors support for version 2.0 of their new programming language Swift. SIMD
SIMD
within a register, or SWAR, is a range of techniques and tricks used for performing SIMD
SIMD
in general-purpose registers on hardware that doesn't provide any direct support for SIMD
SIMD
instructions. This can be used to exploit parallelism in certain algorithms even on hardware that does not support SIMD
SIMD
directly. Microsoft
Microsoft
added SIMD
SIMD
to .NET in RyuJIT.[7] Use of the libraries that implement SIMD
SIMD
on .NET are available in NuGet package Microsoft.Bcl.Simd[8] SIMD
SIMD
on the web[edit] In 2013 John McCutchan announced[9] that he had created a performant interface to SIMD
SIMD
instruction sets for the Dart programming language, bringing the benefits of SIMD
SIMD
to web programs for the first time. The interface consists of two types:

Float32x4, 4 single precision floating point values. Int32x4, 4 32-bit
32-bit
integer values.

Instances of these types are immutable and in optimized code are mapped directly to SIMD
SIMD
registers. Operations expressed in Dart typically are compiled into a single instruction with no overhead. This is similar to C and C++ intrinsics. Benchmarks for 4×4 matrix multiplication, 3D vertex transformation, and Mandelbrot set visualization show near 400% speedup compared to scalar code written in Dart. McCutchan's work on Dart has been adopted by ECMAScript and Intel announced at IDF 2013 that they are implementing McCutchan's specification for both V8 and SpiderMonkey. Emscripten, Mozilla’s C/C++-to-JavaScript compiler, with extensions[10] can enable compilation of C++ programs that make use of SIMD
SIMD
intrinsics or gcc style vector code to SIMD
SIMD
API of JavaScript resulting in equivalent speedups compared to scalar code. Commercial applications[edit] Though it has generally proven difficult to find sustainable commercial applications for SIMD-only processors, one that has had some measure of success is the GAPP, which was developed by Lockheed Martin and taken to the commercial sector by their spin-off Teranex. The GAPP's recent incarnations have become a powerful tool in real-time video processing applications like conversion between various video standards and frame rates ( NTSC
NTSC
to/from PAL, NTSC to/from HDTV formats, etc.), deinterlacing, image noise reduction, adaptive video compression, and image enhancement. A more ubiquitous application for SIMD
SIMD
is found in video games: nearly every modern video game console since 1998 has incorporated a SIMD processor somewhere in its architecture. The PlayStation 2
PlayStation 2
was unusual in that one of its vector-float units could function as an autonomous DSP executing its own instruction stream, or as a coprocessor driven by ordinary CPU
CPU
instructions. 3D graphics applications tend to lend themselves well to SIMD
SIMD
processing as they rely heavily on operations with 4-dimensional vectors. Microsoft's Direct3D 9.0 now chooses at runtime processor-specific implementations of its own math operations, including the use of SIMD-capable instructions. One of the recent processors to use vector processing is the Cell Processor developed by IBM
IBM
in cooperation with Toshiba
Toshiba
and Sony. It uses a number of SIMD
SIMD
processors (a NUMA architecture, each with independent local store and controlled by a general purpose CPU) and is geared towards the huge datasets required by 3D and video processing applications. It differs from traditional ISAs by being SIMD
SIMD
from the ground up with no separate scalar registers. A recent advancement by Ziilabs was the production of an SIMD
SIMD
type processor which can be used on mobile devices, such as media players and mobile phones.[11] Larger scale commercial SIMD
SIMD
processors are available from ClearSpeed Technology, Ltd. and Stream Processors, Inc. ClearSpeed's CSX600 (2004) has 96 cores each with 2 double-precision floating point units while the CSX700 (2008) has 192. Stream Processors is headed by computer architect Bill Dally. Their Storm-1 processor (2007) contains 80 SIMD
SIMD
cores controlled by a MIPS CPU. See also[edit]

SIMD
SIMD
within a register (SWAR) Single Program, Multiple Data (SPMD) OpenCL

References[edit]

^ David A. Patterson and John L. Hennessey, Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751 ^ "MIMD1 - XP/S, CM-5" (PDF).  ^ Conte, G.; Tommesani, S.; Zanichelli, F. (2000). The long and winding road to high-performance image processing with MMX/SSE (PDF). Proc. IEEE Int'l Workshop on Computer Architectures for Machine Perception.  ^ RE: SSE2 speed, showing how SSE2 is used to implement SHA hash algorithms ^ Salsa20 speed; Salsa20 software, showing a stream cipher implemented using SSE2 ^ Subject: up to 1.4x RSA throughput using SSE2, showing RSA implemented using a non- SIMD
SIMD
SSE2 integer multiply instruction. ^ "RyuJIT: The next-generation JIT compiler for .NET".  ^ "The JIT finally proposed. JIT and SIMD
SIMD
are getting married".  ^ https://www.dartlang.org/slides/2013/02/Bringing-SIMD-to-the-Web-via-Dart.pdf ^ https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnx3cG12cDIwMTV8Z3g6NTkzYWE2OGNlNDAyMTRjOQ ^ ZiiLabs Corporate Website https://secure.ziilabs.com/products/processors/zms05.aspx

External links[edit]

SIMD
SIMD
architectures (2000) Cracking Open The Pentium 3 (1999) Short Vector Extensions in Commercial Microprocessor Article about Optimizing the Rendering Pipeline of Animated Models Using the Intel Streaming SIMD
SIMD
Extensions "Yeppp!": cross-platform, open-source SIMD
SIMD
library from Georgia Tech Introduction to Parallel Computing from LLNL Lawrence Livermore National Laboratory

v t e

CPU
CPU
technologies

Architecture

Turing machine Post–Turing machine Universal Turing machine Quantum Turing machine Belt machine Stack machine Register machine Counter machine Pointer machine Random access machine Random access stored program machine Finite-state machine Queue automaton Von Neumann Harvard (modified) Dataflow TTA Cellular Artificial neural network

Machine learning Deep learning Neural processing unit (NPU)

Convolutional neural network Load/store architecture Register memory architecture Endianness FIFO Zero-copy NUMA HUMA HSA Mobile computing Surface computing Wearable computing Heterogeneous computing Parallel computing Concurrent computing Distributed computing Cloud computing Amorphous computing Ubiquitous computing Fabric computing Cognitive computing Unconventional computing Hypercomputation Quantum computing Adiabatic quantum computing Linear optical quantum computing Reversible computing Reverse computation Reconfigurable computing Optical computing Ternary computer Analogous computing Mechanical computing Hybrid computing Digital computing DNA computing Peptide computing Chemical computing Organic computing Wetware computing Neuromorphic computing Symmetric multiprocessing
Symmetric multiprocessing
(SMP) Asymmetric multiprocessing
Asymmetric multiprocessing
(AMP) Cache hierarchy Memory hierarchy

ISA types

ASIP CISC RISC EDGE (TRIPS) VLIW (EPIC) MISC OISC NISC ZISC Comparison

ISAs

x86 z/Architecture ARM MIPS Power Architecture
Power Architecture
(PowerPC) SPARC Mill Itanium
Itanium
(IA-64) Alpha Prism SuperH V850 Clipper VAX Unicore PA-RISC MicroBlaze RISC-V

Word size

1-bit 2-bit 4-bit 8-bit 9-bit 10-bit 12-bit 15-bit 16-bit 18-bit 22-bit 24-bit 25-bit 26-bit 27-bit 31-bit 32-bit 33-bit 34-bit 36-bit 39-bit 40-bit 48-bit 50-bit 60-bit 64-bit 128-bit 256-bit 512-bit Variable

Execution

Instruction pipelining

Bubble Operand forwarding

Out-of-order execution

Register renaming

Speculative execution

Branch predictor Memory dependence prediction

Hazards

Parallel level

Bit

Bit-serial Word

Instruction Pipelining

Scalar Superscalar

Task

Thread Process

Data

Vector

Memory

Multithreading

Temporal Simultaneous (SMT) (Hyper-threading) Speculative (SpMT) Preemptive Cooperative Clustered Multi-Thread (CMT) Hardware scout

Flynn's taxonomy

SISD SIMD
SIMD
(SWAR) SIMT MISD MIMD

SPMD

Addressing mode

CPU
CPU
performance

Instructions per second (IPS) Instructions per clock (IPC) Cycles per instruction (CPI) Floating-point operations per second (FLOPS) Transactions per second (TPS) Synaptic Updates Per Second (SUPS) Performance per watt Orders of magnitude (computing) Cache performance measurement and metric

Core count

Single-core processor Multi-core processor Manycore processor

Types

Central processing unit
Central processing unit
(CPU) GPGPU AI accelerator Vision processing unit (VPU) Vector processor Barrel processor Stream processor Digital signal processor
Digital signal processor
(DSP) I/O processor/DMA controller Network processor Baseband processor Physics processing unit
Physics processing unit
(PPU) Coprocessor Secure cryptoprocessor ASIC FPGA FPOA CPLD Microcontroller Microprocessor Mobile processor Notebook processor Ultra-low-voltage processor Multi-core processor Manycore processor Tile processor Multi-chip module
Multi-chip module
(MCM) Chip stack multi-chip modules System on a chip
System on a chip
(SoC) Multiprocessor system-on-chip (MPSoC) Programmable System-on-Chip
System-on-Chip
(PSoC) Network on a chip (NoC)

Components

Execution unit (EU) Arithmetic logic unit
Arithmetic logic unit
(ALU) Address generation unit
Address generation unit
(AGU) Floating-point unit
Floating-point unit
(FPU) Load-store unit (LSU) Branch predictor Unified Reservation Station Barrel shifter Uncore Sum addressed decoder (SAD) Front-side bus Back-side bus Northbridge (computing) Southbridge (computing) Adder (electronics) Binary multiplier Binary decoder Address decoder Multiplexer Demultiplexer Registers Cache Memory management unit
Memory management unit
(MMU) Input–output memory management unit
Input–output memory management unit
(IOMMU) Integrated Memory Controller (IMC) Power Management Unit (PMU) Translation lookaside buffer
Translation lookaside buffer
(TLB) Stack engine Register file Processor register Hardware register Memory buffer register (MBR) Program counter Microcode
Microcode
ROM Datapath Control unit Instruction unit Re-order buffer Data buffer Write buffer Coprocessor Electronic switch Electronic circuit Integrated circuit Three-dimensional integrated circuit Boolean circuit Digital circuit Analog circuit Mixed-signal integrated circuit Power management integrated circuit Quantum circuit Logic gate

Combinational logic Sequential logic Emitter-coupled logic
Emitter-coupled logic
(ECL) Transistor–transistor logic
Transistor–transistor logic
(TTL) Glue logic

Quantum gate Gate array Counter (digital) Bus (computing) Semiconductor device Clock rate CPU
CPU
multiplier Vision chip Memristor

Power management

APM ACPI Dynamic frequency scaling Dynamic voltage scaling Clock gating

Hardware security

Non-executable memory (NX bit) Memory Protection Extensions (Intel MPX) Intel Secure Key Hardware restriction (firmware) Software Guard Extensions (Intel SGX) Trusted Execution Technology Trusted Platform Module
Trusted Platform Module
(TPM) Secure cryptoprocessor Hardware security module Hengzhi chip

Related

History of general-purpose CPUs

v t e

Parallel computing

General

Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array

Levels

Bit Instruction Thread Task Data Memory Loop Pipeline

Multithreading

Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered Multi-Thread (CMT) Hardware scout

Theory

PRAM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup

Elements

Process Thread Fiber Instruction window

Coordination

Multiprocessing Memory coherency Cache coherency Cache invalidation Barrier Synchronization Application checkpointing

Programming

Stream processing Dataflow programming Models

Implicit parallelism Explicit parallelism Concurrency

Non-blocking algorithm

Hardware

Flynn's taxonomy

SISD SIMD SIMT MISD MIMD

Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor

symmetric asymmetric

Memory

shared distributed distributed shared UMA NUMA COMA

Massively parallel computer Computer cluster Grid computer

APIs

Ateji PX Boost.Thread Chapel Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays MPI OpenMP OpenCL OpenHMPP OpenACC TPL PLINQ PVM POSIX Threads RaftLib UPC TBB ZPL

Problems

Deadlock Livelock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation

 Category: parallel computing Media related to Parallel computing
Parallel computing
at

.