Single instruction, multiple threads (SIMT) is an execution model used in

parallel computing Parallel computing is a type of computing, computation in which many calculations or Process (computing), processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. ...

where

single instruction, multiple data Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...

(SIMD) is combined with zero-overhead multithreading, i.e. multithreading where the hardware is capable of switching between threads on a cycle-by-cycle basis. There are two models of multithreading involved. In addition to the zero-overhead multithreading mentioned, the SIMD execution hardware is virtualized to represent a multiprocessor, but is inferior to a

SPMD In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism whereby multiple processors cooperate in the execution of a program in order to obtain results faster. ...

processor in that instructions in all "threads" are executed in lock-step in the lanes of the SIMD processor which can only execute the same instruction in a given cycle across all lanes. The SIMT execution model has been implemented on several GPUs and is relevant for

general-purpose computing on graphics processing units General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditional ...

(GPGPU), e.g. some

supercomputer A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...

s combine CPUs with GPUs. The processors, say a number of them, seem to execute many more than tasks. This is achieved by each processor having multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), which execute in lock-step, and are analogous to SIMD lanes. The simplest way to understand SIMT is to imagine a multi-core system, where each core has its own register file, its own ALUs (both SIMD and Scalar) and its own data cache, but that unlike a standard multi-core system which has multiple independent instruction caches and decoders, as well as multiple independent Program Counter registers, the instructions are synchronously broadcast to all SIMT cores from a single unit with a single instruction cache and a single instruction decoder which reads instructions using a single Program Counter. The key difference between SIMT and SIMD lanes is that each of the SIMT cores may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas SIMD lanes are simply part of an ALU that knows nothing about memory per se. However, the SIMT execution model is still only a way to present to the programmer what is fundamentally still a SIMD core. Programs must be designed with the SIMD architecture in mind. SIMT may allow threads to diverge by branching, but if possible this must be avoided. A branch will result in the equivalent of the execution of multiple SIMD instructions where certain SIMD lanes are masked to not participate and remain idle, which is of course not desirable. In other words, the multithreading aspect of SIMT is only a way to organize the flow of computation. It is not a feature that in and of itself the programmer should attempt to exploit to its full extend. Also important to note is the difference between SIMT and

- Single Program Multiple Data. SPMD, like standard multi-core systems, has multiple Program Counters.

History

SIMT was introduced by

NVIDIA Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

in the Tesla GPU microarchitecture with the G80 chip.

ATI Technologies ATI Technologies Inc. was a Canadian semiconductor industry, semiconductor technology corporation based in Markham, Ontario, that specialized in the development of graphics processing units and chipsets. Founded in 1985, the company listed pub ...

, now

AMD Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...

, released a competing product slightly later on May 14, 2007, the TeraScale 1-based ''"R600"'' GPU chip.

Description

As access time of all the widespread

RAM Ram, ram, or RAM most commonly refers to: * A male sheep * Random-access memory, computer memory * Ram Trucks, US, since 2009 ** List of vehicles named Dodge Ram, trucks and vans ** Ram Pickup, produced by Ram Trucks Ram, ram, or RAM may also ref ...

types (e.g.

DDR SDRAM Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM) is a double data rate (DDR) synchronous dynamic random-access memory (SDRAM) class of memory integrated circuits used in computers. DDR SDRAM, also retroactively called DDR ...

GDDR SDRAM Graphics DDR SDRAM (GDDR SDRAM) is a type of synchronous dynamic random-access memory (SDRAM) specifically designed for applications requiring high bandwidth, e.g. graphics processing units (GPUs). GDDR SDRAM is distinct from the more widely kno ...

, XDR DRAM, etc.) is still relatively high, engineers came up with the idea to hide the latency that inevitably comes with each memory access. Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs. This might or might not be considered to be a property of 'SIMT' itself. SIMT is intended to limit instruction fetching overhead, i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of

and

) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to hyperthreading in CPUs. As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times. A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilization where a processor's threads follow different control-flow paths. For instance, to handle an ''IF''-''ELSE'' block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor. NVIDIA GPUs have a concept of the thread group called as "warp" composed of 32 hardware threads executed in lock-step. The equivalent in AMD GPUs is "wavefront", although it is composed of 64 hardware threads. In OpenCL, it is called as "sub-group" for the abstract term of warp and wavefront. CUDA also has the warp shuffle instructions which make parallel data exchange in the thread group faster, and OpenCL allows a similar feature support by an extension cl_khr_subgroups.cl_khr_subgroups(3) Manual Page
/ref>

References

{{Graphics Processing Unit Classes of computers Computer architecture GPGPU Parallel computing SIMD computing Threads (computing)

History

Description

See also

References