computer engineering Computer engineering (CoE or CpE) is a branch of electrical engineering and computer science that integrates several fields of computer science and electronic engineering required to develop computer hardware and software. Computer enginee ...

, out-of-order execution (or more formally dynamic execution) is a paradigm used in most high-performance

central processing unit A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...

s to make use of

instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...

s that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

History

Out-of-order execution is a restricted form of

data flow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Da ...

computation, which was a major research area in

computer architecture In computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, the ...

in the 1970s and early 1980s. The first machine to use out-of-order execution was the

CDC 6600 The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM ...

(1964), designed by James E. Thornton, which uses a

scoreboard A scoreboard is a large board for publicly displaying the score in a game. Most levels of sport from high school and above use at least one scoreboard for keeping score, measuring time, and displaying statistics. Scoreboards in the past used ...

to avoid conflicts. It permits an instruction to execute if its source operand (read) addresses aren't to be written to by any unexecuted earlier instruction (true dependency) and the destination (write) address not be an address used by any unexecuted earlier instruction (false dependency). The 6600 lacks the means to avoid stalling an

execution unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...

on false dependencies ( write after write (WAW) and

write after read In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect com ...

(WAR) conflicts, respectively termed "first order conflict" and "third order conflict" by Thornton, who termed true dependencies ( read after write (RAW)) as "second order conflict") because each address has only a single location referable by it. The WAW is worse than WAR for the 6600, because when an execution unit encounters a WAR, the other execution units still receive and execute instructions, but upon a WAW the assignment of instructions to execution units stops, and they can not receive any further instructions until the WAW-causing instruction's destination register has been written to by earlier instruction. About two years later, the

IBM System/360 Model 91 The IBM System/360 Model 91 was announced in 1964 as a competitor to the CDC 6600. Functionally, the Model 91 ran like any other large-scale System/360, but the internal organization was the most advanced of the System/360 line, and it was the ...

(1966) introduced

register renaming In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a partic ...

with

Tomasulo's algorithm Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in ...

, which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''r_n'' can be executed before an earlier instruction using the register ''r_n'' is executed, by actually writing into an alternative (renamed) register ''alt-r_n'', which is turned into a normal ("architectural") register ''r_n'' only when all the earlier instructions addressing ''r_n'' have been executed, but until then ''r_n'' is given for earlier instructions and ''alt-r_n'' for later ones addressing ''r_n''. Another advantage the Model 91 has over the 6600 is the ability to execute out-of-order the instructions ''at the same

'', not just between the units like the 6600. This is accomplished by reservation stations, from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of the 6600. Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point code. The 91 and 6600 both also suffer from imprecise exceptions, which needed to be solved before out-of-order execution would be used outside supercomputers. To have precise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by James E. Smith and Andrew R. Pleszkun.
(Expanded version published in May 1988 a
''Implementing Precise Interrupts in Pipelined Processors''
) The CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions. Smith simulated that adding a reorder buffer (or history buffer or equivalent) to Cray-1S would reduce the performance of executing the first 14

Livermore loops Livermore loops (also known as the Livermore Fortran kernels or LFK) is a benchmark for parallel computers. It was created by Francis H. McMahon from scientific source code run on computers at Lawrence Livermore National Laboratory. It consists of ...

(unvectorized) by only 3%. Important academic research in this subject was led by

Yale Patt Yale Nance Patt is an American professor of electrical and computer engineering at The University of Texas at Austin. He holds the Ernest Cockrell, Jr. Centennial Chair in Engineering. In 1965, Patt introduced the WOS module, the first complex ...

with his HPSm simulator. With the POWER1 (1990) IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a datafull reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named ''program counter stack'' by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have reservation stations needed for out-of-order use of a same execution unit. The next year IBM's ES/9000 model 900 had register renaming also for the general-purpose registers. It also has reservation stations with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The re-ordering distance is up to 32 instructions. The first

superscalar A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a sup ...

single-chip processors (

Intel i960 Intel's i960 (or 80960) was a RISC-based microprocessor design that became popular during the early 1990s as an embedded microcontroller. It became a best-selling CPU in that segment, along with the competing AMD 29000. In spite of its succes ...

CA in 1989) used a simple

scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data depende ...

scheduling like the CDC 6600 had quarter of a century earlier, but in 1992-1996 a rapid advancement of techniques, enabled by increasing transistor counts, saw proliferation down to personal computers. Motorola 88110 (1992) used a history buffer to revert instructions. Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance.

PowerPC 601 The PowerPC 600 family was the first family of PowerPC processors built. They were designed at the Somerset facility in Austin, Texas, jointly funded and staffed by engineers from IBM and Motorola as a part of the AIM alliance. Somerset was opene ...

(1993) was an evolution of the RISC Single Chip, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched-instruction-queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count register could be renamed. In the fall of 1994 NexGen and IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first x86 processor capable of out-of-order execution, accomplished with micro-OPs. The re-ordering distance is up to 14 micro-OPs. PowerPC 603 renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry

re-order buffer A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order. The buffer is a circular buff ...

lets no more than four instructions to overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store. PowerPC 604 (1995) was the first single-chip processor with

-level re-ordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The re-order buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the re-ordering of loads and stores upon cache misses. HAL SPARC64 (1995) exceeded the re-ordering capacity of the ES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, and

address generation unit The address generation unit (AGU), sometimes also called address computation unit (ACU), is an execution unit inside central processing units (CPUs) that calculates addresses used by the CPU to access main memory. By having address calculations ...

, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a re-ordered state at a time

Pentium Pro The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It introduced the P6 microarchitecture (sometimes termed i686) and was originally intended to replace the original ...

(1995) introduced a '' unified reservation station'', which at the 20 micro-OP capacity permitted very flexible re-ordering, backed by a 40-entry re-order buffer. Loads can be re-ordered ahead of both loads and stores. The practically attainable per-cycle rate of execution rose more as full out-of-order execution was further adopted by SGI/ MIPS ( R10000) and HP

PA-RISC PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to a ...

( PA-8000) in 1996. The same year

Cyrix 6x86 The Cyrix 6x86 is a line of sixth-generation, 32-bit x86 microprocessors designed and released by Cyrix in 1995. Cyrix, being a fabless company, had the chips manufactured by IBM and SGS-Thomson. The 6x86 was made as a direct competitor to I ...

and

AMD K5 The K5 is AMD's first x86 processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel's Pentium microprocessor. The K5 was an ambitious design, closer to a Pentium Pro than a Pentium regarding techn ...

brought advanced re-ordering techniques into mainstream

personal computer A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tech ...

s. Since

DEC Alpha Alpha (original name Alpha AXP) is a 64-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). Alpha was designed to replace 32-bit VAX complex instruction set computer ...

gained out-of-order execution in 1998 (

Alpha 21264 The Alpha 21264 is a Digital Equipment Corporation RISC microprocessor launched on 19 October 1998. The 21264 implemented the Alpha instruction set architecture (ISA). Description The Alpha 21264 is a four-issue superscalar microprocessor with ...

), the top-performing out-of-order processor cores have been unmatched by in-order cores other than HP/

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the devel ...

Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comp ...

2 and IBM

POWER6 The POWER6 is a microprocessor developed by IBM that implemented the Power ISA v.2.03. When it became available in systems in 2007, it succeeded the POWER5+ as IBM's flagship Power microprocessor. It is claimed to be part of the eCLipz projec ...

, though the latter had an out-of-order

floating-point unit In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...

. The other high-end in-order processors fell far behind, namely Sun's

UltraSPARC III The UltraSPARC III, code-named "Cheetah", is a microprocessor that implements the SPARC, SPARC V9 instruction set architecture (ISA) developed by Sun Microsystems and fabricated by Texas Instruments. It was introduced in 2001 and operates at 600 ...

/ IV, and IBM's

mainframes A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterpris ...

which had lost the out-of-order execution capability for the second time, remaining in-order into the z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the SPARC T series and Xeon Phi changed to out-of-order execution in 2011 and 2016 respectively. Almost all processors for phones and other lower-end applications remained in-order until c. 2010. First, Qualcomm's

Scorpion Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, and are easily recognized by a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always endi ...

(re-ordering distance of 32) shipped in

Snapdragon ''Antirrhinum'' is a genus of plants commonly known as dragon flowers, snapdragons and dog flower because of the flowers' fancied resemblance to the face of a dragon that opens and closes its mouth when laterally squeezed. They are native to ...

, and a bit later Arm's A9 succeeded A8. For low-end x86

s in-order early Intel Atoms were first challenged by AMD's

Bobcat The bobcat (''Lynx rufus''), also known as the red lynx, is a medium-sized cat native to North America. It ranges from southern Canada through most of the contiguous United States to Oaxaca in Mexico. It is listed as Least Concern on the IU ...

, and in 2013 were succeeded by an out-of-order Silvermont. Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in

microcontroller A microcontroller (MCU for ''microcontroller unit'', often also MC, UC, or μC) is a small computer on a single VLSI integrated circuit (IC) chip. A microcontroller contains one or more CPUs ( processor cores) along with memory and programma ...

s and

embedded system An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded'' ...

s, as well as in phone-class cores such as Arm's A55 and A510 in

big.LITTLE ARM big.LITTLE is a heterogeneous computing architecture developed by ARM Holdings, coupling relatively battery-saving and slower processor cores (''LITTLE'') with relatively more powerful and power-hungry ones (''big''). Typically, only one " ...

configurations.

Basic concept

To appreciate OoO Execution it is useful to first describe in-order, to be able to make a comparison of the two. Instructions cannot be completed instantaneously: they take time (multiple cycles). Therefore, results will lag behind where they are needed. In-order still has to keep track of the dependencies. Its approach is however quite unsophisticated: stall, every time. OoO uses much more sophisticated data tracking techniques, as seen below.

In-order processors

In earlier processors, the processing of instructions is performed in an

normally consisting of the following steps: # Instruction fetch. # If input

operand In mathematics, an operand is the object of a mathematical operation, i.e., it is the object or quantity that is operated on. Example The following arithmetic expression shows an example of operators and operands: :3 + 6 = 9 In the above exa ...

s are available (in processor registers, for instance), the instruction is dispatched to the appropriate

functional unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...

. If one or more operands are unavailable during the current clock cycle (generally because they are being fetched from

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...

), the processor stalls until they are available. # The instruction is executed by the appropriate functional unit. # The functional unit writes the results back to the

register file A register file is an array of processor registers in a central processing unit (CPU). Register banking is the method of using a single name to access multiple different physical registers depending on the operating mode. Modern integrated circuit ...

. Often, an in-order processor will have a straightforward "

bit vector A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level p ...

" which records which registers a pipeline that it will (eventually) write to. If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.

Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps: # Instruction fetch. # Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations). # The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions. # The instruction is issued to the appropriate functional unit and executed by that unit. # The results are queued. # Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage. The key concept of OoOE processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoOE processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data. OoOE processors fill these "slots" in time with other instructions that ''are'' ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order. The benefit of OoOE processing grows as the

instruction pipeline In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing inco ...

deepens and the speed difference between

main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a comput ...

(or

cache memory In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...

) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.

Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was ''decoupled architecture''. In the earlier ''in-order'' processors, these stages operated in a fairly lock-step, pipelined fashion. The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer. The buffer's purpose is to partition the memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain parallelism between the two. In doing so, it effectively hides all

memory latency ''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will ha ...

from the processor's perspective. A larger buffer can, in theory, increase throughput. However, if the processor has a branch misprediction then the entire buffer may need to be flushed, wasting a lot of clock cycles and reducing the effectiveness. Furthermore, larger buffers create more heat and use more die space. For this reason processor designers today favour a multi-threaded design approach. Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well. Control intensive code include such things as nested branches that occur frequently in

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...

kernels. Decoupled architectures play an important role in scheduling in

very long instruction word Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to exe ...

(VLIW) architectures. To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called

is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.

Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions. The ability to issue instructions past branches that are yet to resolve is known as

speculative execution Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing ...

Micro-architectural choices

* Are the instructions dispatched to a centralized queue or to multiple distributed queues? : IBM

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple– IBM ...

processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues. * Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight. :Early Intel out-of-order processors use a results queue called a

, while most later out-of-order processors use register maps. :More precisely: Intel P6 family microprocessors have both a re-order buffer (ROB) and a register alias table (RAT). The ROB was motivated mainly by branch misprediction recovery. :The Intel P6 family is among the earliest OoOE microprocessors but were supplanted by the NetBurst architecture. Years later, Netburst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the Core and Nehalem microarchitectures. The succeeding

Sandy Bridge Sandy Bridge is the codename for Intel's 32 nm microarchitecture used in the second generation of the Intel Core processors ( Core i7, i5, i3). The Sandy Bridge microarchitecture is the successor to Nehalem and Westmere microarchitecture ...

, Ivy Bridge, and Haswell microarchitectures are a departure from the reordering techniques used in P6 and employ re-ordering techniques from the EV6 and the P4 but with a somewhat shorter pipeline.