Out Of Order Execution
   HOME

TheInfoList



OR:

In
computer engineering Computer engineering (CE, CoE, or CpE) is a branch of engineering specialized in developing computer hardware and software. It integrates several fields of electrical engineering, electronics engineering and computer science. Computer engi ...
, out-of-order execution (or more formally dynamic execution) is an
instruction scheduling In computer science, instruction scheduling is a compiler optimization used to improve instruction-level parallelism, which improves performance on machines with instruction pipelines. Put more simply, it tries to do the following without changing ...
paradigm used in high-performance
central processing unit A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary Processor (computing), processor in a given computer. Its electronic circuitry executes Instruction (computing), instructions ...
s to make use of
instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch–execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...
s that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.


History

Out-of-order execution is a restricted form of
dataflow architecture Dataflow architecture is a dataflow-based computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures have no program counter, in concept: the executability and ex ...
, which was a major research area in computer architecture in the 1970s and early 1980s.


Early use in supercomputers

The first machine to use out-of-order execution was the
CDC 6600 The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the I ...
(1964), designed by James E. Thornton, which uses a
scoreboard A scoreboard is a large board for publicly displaying the score (sport), score in a game. Most levels of sport from high school and above use at least one scoreboard for keeping score, measuring time, and displaying statistics. Scoreboards i ...
to avoid conflicts. It permits an instruction to execute if its source operand (read) registers aren't to be written to by any unexecuted earlier instruction (true dependency) and the destination (write) register not be a register used by any unexecuted earlier instruction (false dependency). The 6600 lacks the means to avoid stalling an
execution unit In computer engineering, an execution unit (E-unit or EU) is a part of a processing unit that performs the operations and calculations forwarded from the instruction unit. It may have its own internal control sequence unit (not to be confused w ...
on false dependencies ( write after write (WAW) and write after read (WAR) conflicts, respectively termed ''first-order conflict'' and ''third-order conflict'' by Thornton, who termed true dependencies ( read after write (RAW)) as second-order conflict) because each address has only a single location referable by it. The WAW is worse than WAR for the 6600, because when an execution unit encounters a WAR, the other execution units still receive and execute instructions, but upon a WAW the assignment of instructions to execution units stops, and they can not receive any further instructions until the WAW-causing instruction's destination register has been written to by earlier instruction. About two years later, the IBM System/360 Model 91 (1966) introduced
register renaming In computer architecture, register renaming is a technique that abstracts logical processor register, registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instructio ...
with Tomasulo's algorithm, which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''rn'' can be executed before an earlier instruction using the register ''rn'' is executed, by actually writing into an alternative (renamed) register ''alt-rn'', which is turned into a normal register ''rn'' only when all the earlier instructions addressing ''rn'' have been executed, but until then ''rn'' is given for earlier instructions and ''alt-rn'' for later ones addressing ''rn''. In the Model 91 the register renaming is implemented by a bypass termed ''Common Data Bus'' (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order in the same
execution unit In computer engineering, an execution unit (E-unit or EU) is a part of a processing unit that performs the operations and calculations forwarded from the instruction unit. It may have its own internal control sequence unit (not to be confused w ...
, not just between the units like the 6600. This is accomplished by
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservatio ...
s, from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of the 6600. The Model 91 is also capable of reordering loads and stores to execute before the preceding loads and stores, unlike the 6600, which only has a limited ability to move loads past loads, and stores past stores, but not loads past stores and stores past loads. Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point calculations. The 91 and 6600 both also suffer from imprecise exceptions, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers.


Precise exceptions

To have
precise exception Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in ...
s, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by James E. Smith and Andrew R. Pleszkun.
(Expanded version published in May 1988 a
''Implementing Precise Interrupts in Pipelined Processors''
)
The
CDC Cyber 205 The CDC Cyber range of mainframe-class supercomputers were the primary products of Control Data Corporation (CDC) during the 1970s and 1980s. In their day, they were the computer architecture of choice for scientific and mathematically intensi ...
was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions. Through simulation, Smith determined that adding a reorder buffer (or history buffer or equivalent) to the
Cray-1S The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, eighty Cray-1s were sold, making it one of the m ...
would reduce the performance of executing the first 14
Livermore loops Livermore loops (also known as the Livermore Fortran kernels or LFK) is a benchmark for parallel computers. It was created by Francis H. McMahon from scientific source code run on computers at Lawrence Livermore National Laboratory. It consists o ...
(unvectorized) by only 3%. Important academic research in this subject was led by
Yale Patt Yale Nance Patt is an American professor of electrical and computer engineering at the University of Texas at Austin. He holds the Ernest Cockrell, Jr. Centennial Chair in Engineering. In 1965, Patt introduced the WOS module, the first complex l ...
with his HPSm simulator. In the 1980s many early
RISC In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a comp ...
microprocessors, like the
Motorola 88100 The MC88100 is a microprocessor developed by Motorola that implemented 88000 RISC instruction set architecture. Announced in 1988, the MC88100 was the first 88000 implementation. It was succeeded by the MC88110 in the early 1990s. The microproce ...
, had out-of-order writeback to the registers, resulting in imprecise exceptions. Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600.


Decoupling

Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching. He implemented those ideas in the
Astronautics Astronautics (or cosmonautics) is the practice of sending spacecraft beyond atmosphere of Earth, Earth's atmosphere into outer space. Spaceflight is one of its main applications and space science is its overarching field. The term ''astronautics' ...
ZS-1 (1988), featuring a decoupling of the integer/load/store
pipeline A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
from the floating-point pipeline, allowing inter-pipeline reordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper, he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many use cases, as it even permits
virtual memory In computing, virtual memory, or virtual storage, is a memory management technique that provides an "idealized abstraction of the storage resources that are actually available on a given machine" which "creates the illusion to users of a ver ...
. Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent the stalling of the front end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressable queues that effectively performed limited register renaming. A similar decoupled architecture had been used a bit earlier in the Culler 7. The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches.


Research comes to fruition

With the POWER1 (1990), IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named ''program counter stack'' by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservatio ...
s needed for out-of-order use of the same execution unit. The next year IBM's ES/9000 model 900 had register renaming added for the general-purpose registers. It also has
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservatio ...
s with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The reordering distance is up to 32 instructions. The A19 of
Unisys Unisys Corporation is a global technology solutions company founded in 1986 and headquartered in Blue Bell, Pennsylvania. The company provides cloud, AI, digital workplace, logistics, and enterprise computing services. History Founding Unis ...
' A-series of mainframes was also released in 1991 and was claimed to have out-of-order execution, and one analyst called the A19's technology three to five years ahead of the competition.


Wide adoption

The first
superscalar A superscalar processor (or multiple-issue processor) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single in ...
single-chip processors ( Intel i960CA in 1989) used a simple scoreboarding scheduling like the CDC 6600 had a quarter of a century earlier. In 1992–1996 a rapid advancement of techniques, enabled by increasing transistor counts, saw proliferation down to
personal computer A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
s. The
Motorola 88110 The MC88110 was a microprocessor developed by Motorola that implemented the 88000 instruction set architecture (ISA). The MC88110 was a second-generation implementation of the 88000 ISA, succeeding the MC88100. It was designed for use in persona ...
(1992) used a history buffer to revert instructions. Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance. The PowerPC 601 (1993) was an evolution of the RISC Single Chip, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched instruction queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count registers could be renamed. In the fall of 1994
NexGen NexGen, Inc. was a private semiconductor company based in Milpitas, California, that designed x86 microprocessors until it was purchased by AMD on January 16, 1996. NexGen was a fabless design house that designed its chips but relied on other c ...
and IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
processor capable of out-of-order execution and featured a reordering distance of up to 14
micro-operation In computer central processing units, micro-operations (also known as micro-ops or μops, historically also as micro-actions) are detailed low-level instructions used in some designs to implement complex machine instructions (sometimes termed ma ...
s. The PowerPC 603 renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry reorder buffer lets no more than four instructions overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store. PowerPC 604 (1995) was the first single-chip processor with
execution unit In computer engineering, an execution unit (E-unit or EU) is a part of a processing unit that performs the operations and calculations forwarded from the instruction unit. It may have its own internal control sequence unit (not to be confused w ...
-level reordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The reorder buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the reordering of loads and stores upon cache misses.
HAL SPARC64 SPARC64 is a microprocessor developed by HAL Computer Systems and fabricated by Fujitsu. It implements the SPARC V9 instruction set architecture (ISA), the first microprocessor to do so. SPARC64 was HAL's first microprocessor and was the first ...
(1995) exceeded the reordering capacity of the ES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, and
address generation unit The address generation unit (AGU), sometimes also called address computation unit (ACU), is an execution unit inside central processing units (CPUs) that calculates addresses used by the CPU to access main memory. By having address calculations ...
, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a reordered state at a time.
Pentium Pro The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It implements the P6 (microarchitecture), P6 microarchitecture (sometimes termed i686), and was the first x86 Intel C ...
(1995) introduced a '' unified reservation station'', which at the 20 micro-OP capacity permitted very flexible reordering, backed by a 40-entry reorder buffer. Loads can be reordered ahead of both loads and stores. The practically attainable per-cycle rate of execution rose further as full out-of-order execution was further adopted by SGI/ MIPS (
R10000 The R10000, code-named "T5", is a RISC microprocessor implementation of the MIPS IV instruction set architecture (ISA) developed by MIPS Technologies, Inc. (MTI), then a division of Silicon Graphics, Inc. (SGI). The chief designers are Chris Ro ...
) and HP
PA-RISC Precision Architecture reduced instruction set computer, RISC (PA-RISC) or Hewlett Packard Precision Architecture (HP/PA or simply HPPA), is a computer, general purpose computer instruction set architecture (ISA) developed by Hewlett-Packard f ...
(
PA-8000 The PA-8000 (PCX-U), code-named ''Onyx'', is a microprocessor developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC, PA-RISC 2.0 instruction set architecture (ISA).#Hunt_1995, Hunt 1995 It was a completely new design with ...
) in 1996. The same year
Cyrix 6x86 The Cyrix 6x86 is a line of sixth-generation, 32-bit x86 microprocessors designed and released by Cyrix in 1995. Cyrix, being a fabless company, had the chips manufactured by IBM and SGS-Thomson. The 6x86 was made as a direct competitor to Intel ...
and
AMD K5 The K5 is AMDs first x86 processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel's Pentium microprocessor. The K5 was an ambitious design, closer to a Pentium Pro than a Pentium regarding technic ...
brought advanced reordering techniques into mainstream personal computers. Since
DEC Alpha Alpha (original name Alpha AXP) is a 64-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). Alpha was designed to replace 32-bit VAX complex instruction set computers ( ...
gained out-of-order execution in 1998 (
Alpha 21264 The Alpha 21264, also known by its code name, EV6, is a RISC microprocessor developed by Digital Equipment Corporation launched on 19 October 1998. The 21264 implemented the Alpha instruction set architecture (ISA). Description The Alpha 2126 ...
), the top-performing out-of-order processor cores have been unmatched by in-order cores other than HP/
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
Itanium 2 Itanium (; ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly developed by HP and I ...
and IBM POWER6, though the latter had an out-of-order
floating-point unit A floating-point unit (FPU), numeric processing unit (NPU), colloquially math coprocessor, is a part of a computer system specially designed to carry out operations on floating-point numbers. Typical operations are addition, subtraction, multip ...
. The other high-end in-order processors fell far behind, namely
Sun The Sun is the star at the centre of the Solar System. It is a massive, nearly perfect sphere of hot plasma, heated to incandescence by nuclear fusion reactions in its core, radiating the energy from its surface mainly as visible light a ...
's UltraSPARC III/ IV, and IBM's
mainframe A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterpris ...
s which had lost the out-of-order execution capability for the second time, remaining in-order into the z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the SPARC T series and
Xeon Phi Xeon Phi is a discontinued series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and applicati ...
changed to out-of-order execution in 2011 and 2016 respectively. Almost all processors for phones and other lower-end applications remained in-order until . First,
Qualcomm Qualcomm Incorporated () is an American multinational corporation headquartered in San Diego, California, and Delaware General Corporation Law, incorporated in Delaware. It creates semiconductors, software and services related to wireless techn ...
's
Scorpion Scorpions are predatory arachnids of the Order (biology), order Scorpiones. They have eight legs and are easily recognized by a pair of Chela (organ), grasping pincers and a narrow, segmented tail, often carried in a characteristic forward cur ...
(reordering distance of 32) shipped in Snapdragon, and a bit later
Arm In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between ...
's A9 succeeded A8. For low-end
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
personal computer A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
s in-order Bonnell microarchitecture in early
Intel Atom Intel Atom is a line of IA-32 and x86-64 instruction set ultra-low-voltage processors by Intel Corporation designed to reduce electric consumption and power dissipation in comparison with ordinary processors of the Intel Core series. Atom is m ...
processors were first challenged by
AMD Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...
's Bobcat microarchitecture, and in 2013 were succeeded by an out-of-order Silvermont microarchitecture. Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in
microcontroller A microcontroller (MC, uC, or μC) or microcontroller unit (MCU) is a small computer on a single integrated circuit. A microcontroller contains one or more CPUs (processor cores) along with memory and programmable input/output peripherals. Pro ...
s and
embedded system An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is e ...
s, as well as in phone-class cores such as Arm's A55 and A510 in big.LITTLE configurations.


Basic concept


Background

Out-of-order execution is more sophisticated relative to the baseline of in-order execution. In pipelined in-order execution processors, execution of instructions overlap in pipelined fashion with each requiring multiple
clock cycle In electronics and especially synchronous digital circuits, a clock signal (historically also known as ''logic beat'') is an electronic logic signal (voltage or current) which oscillates between a high and a low state at a constant frequency and ...
s to complete. The consequence is that results from a previous instruction will lag behind where they may be needed in the next. In-order execution still has to keep track of these dependencies. Its approach is however quite unsophisticated: stall, every time. Out-of-order uses much more sophisticated data tracking techniques, as described below.


In-order processors

In earlier processors, the processing of instructions is performed in an
instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch–execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...
normally consisting of the following steps: # Instruction fetch. # If input
operand In mathematics, an operand is the object of a mathematical operation, i.e., it is the object or quantity that is operated on. Unknown operands in equalities of expressions can be found by equation solving. Example The following arithmetic expres ...
s are available (in processor registers, for instance), the instruction is dispatched to the appropriate functional unit. If one or more operands are unavailable during the current clock cycle (generally because they must be fetched from
memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...
), the processor stalls until they are available. # The instruction is executed by the appropriate functional unit. # The functional unit writes the results back to the
register file A register file is an array of processor registers in a central processing unit (CPU). The instruction set architecture of a CPU will almost always define a set of registers which are used to stage data between memory and the functional units on ...
. Often, an in-order processor has a
bit vector A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level p ...
recording which registers will be written to by a pipeline. If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.


Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps: # Instruction fetch. # Instruction decoding. # Instruction renaming. # Instruction dispatch to an instruction queue (also called instruction buffer or
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservatio ...
s). # The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions. # The instruction is issued to the appropriate functional unit and executed by that unit. # The results are queued. # Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage. The key concept of out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the processor avoids the stall that occurs in step 2 of the in-order processor when the instruction is not completely ready to be processed due to missing data. Out-of-order processors fill these ''slots'' in time with other instructions that ''are'' ready, then reorder the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data becomes available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output. The benefit of out-of-order processing grows as the
instruction pipeline In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming Mac ...
deepens and the speed difference between
main memory Computer data storage or digital data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processin ...
(or
cache memory In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsew ...
) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have theoretically processed a large number of instructions.


Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues that allow the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was ''decoupled architecture''. In the earlier ''in-order'' processors, these stages operated in a fairly lock-step, pipelined fashion. The fetch and decode stages is separated from the execute stage in a pipelined processor by using a
buffer Buffer may refer to: Science * Buffer gas, an inert or nonflammable gas * Buffer solution, a solution used to prevent changes in pH * Lysis buffer, in cell biology * Metal ion buffer * Mineral redox buffer, in geology Technology and engineeri ...
. The buffer's purpose is to partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine-grain parallelism between the two. In doing so, it effectively hides all
memory latency ''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will ha ...
from the processor's perspective. A larger buffer can, in theory, increase throughput. However, if the processor has a
branch misprediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow ...
then the entire buffer may need to be flushed, wasting a lot of
clock cycle In electronics and especially synchronous digital circuits, a clock signal (historically also known as ''logic beat'') is an electronic logic signal (voltage or current) which oscillates between a high and a low state at a constant frequency and ...
s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more die space. For this reason processor designers today favor a
multi-threaded In computer architecture, multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to provide multiple threads of execution. Overview The multithreading paradigm has become more popular a ...
design approach. Decoupled architectures are generally thought of as not useful for general-purpose computing as they do not handle control-intensive code well. Control intensive code include such things as nested branches that occur frequently in
operating system kernel A kernel is a computer program at the core of a computer's operating system that always has complete control over everything in the system. The kernel is also responsible for preventing and mitigating conflicts between different processes. It is ...
s. Decoupled architectures play an important role in scheduling in
very long instruction word Very long instruction word (VLIW) refers to instruction set architectures that are designed to exploit instruction-level parallelism (ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel, whereas conve ...
(VLIW) architectures.


Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions. The results queue allows programs to be restarted after an exception and for the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions. The ability to issue instructions past branches that have yet to be resolved is known as
speculative execution Speculative execution is an optimization (computer science), optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that woul ...
.


Micro-architectural choices

Are the instructions dispatched to a centralized queue or to multiple distributed queues? :
IBM PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple–IBM–M ...
processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues. Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight. :Early Intel out-of-order processors use a results queue called a reorder buffer, while most later out-of-order processors use register maps.


See also

*
Memory barrier In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued ...
*
Replay system The replay system is a subsystem within the Intel Pentium 4 processor. Its primary function is to catch operations that have been mistakenly sent for execution by the processor's scheduler. Operations caught by the replay system are then re-execu ...
* Shelving buffer


Notes


References

*


Further reading

* {{DEFAULTSORT:Out-Of-Order Execution Instruction processing