computer engineering Computer engineering (CoE or CpE) is a branch of electrical engineering and computer science that integrates several fields of computer science and electronic engineering required to develop computer hardware and software. Computer engineers n ...

, out-of-order execution (or more formally dynamic execution) is a

paradigm In science and philosophy, a paradigm () is a distinct set of concepts or thought patterns, including theories, research methods, postulates, and standards for what constitute legitimate contributions to a field. Etymology ''Paradigm'' comes f ...

used in most high-performance

central processing unit A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, a ...

s to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

History

Out-of-order execution is a restricted form of data flow computation, which was a major research area in computer architecture in the 1970s and early 1980s. The first machine to use out-of-order execution was the CDC 6600 (1964), designed by

James E. Thornton James E. Thornton (September 25, 1925, Saint Paul, Minnesota–January 11, 2005) was an American computer engineer. Thornton studied electrical engineering at the University of Minnesota earning a bachelor's degree in 1950. Immediately afterwards ...

, which uses a

scoreboard A scoreboard is a large board for publicly displaying the score in a game. Most levels of sport from high school and above use at least one scoreboard for keeping score, measuring time, and displaying statistics. Scoreboards in the past used ...

to avoid conflicts. It permits an instruction to execute if its source operand (read) addresses aren't to be written to by any unexecuted earlier instruction (true dependency) and the destination (write) address not be an address used by any unexecuted earlier instruction (false dependency). The 6600 lacks the means to avoid stalling an execution unit on false dependencies ( write after write (WAW) and write after read (WAR) conflicts, respectively termed "first order conflict" and "third order conflict" by Thornton, who termed true dependencies ( read after write (RAW)) as "second order conflict") because each address has only a single location referable by it. The WAW is worse than WAR for the 6600, because when an execution unit encounters a WAR, the other execution units still receive and execute instructions, but upon a WAW the assignment of instructions to execution units stops, and they can not receive any further instructions until the WAW-causing instruction's destination register has been written to by earlier instruction. About two years later, the IBM System/360 Model 91 (1966) introduced register renaming with Tomasulo's algorithm, which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''r_n'' can be executed before an earlier instruction using the register ''r_n'' is executed, by actually writing into an alternative (renamed) register ''alt-r_n'', which is turned into a normal ("architectural") register ''r_n'' only when all the earlier instructions addressing ''r_n'' have been executed, but until then ''r_n'' is given for earlier instructions and ''alt-r_n'' for later ones addressing ''r_n''. Another advantage the Model 91 has over the 6600 is the ability to execute out-of-order the instructions ''at the same execution unit'', not just between the units like the 6600. This is accomplished by

reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming In computer architecture, register renaming is a technique that abstracts logical r ...

s, from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of the 6600. Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point code. The 91 and 6600 both also suffer from imprecise exceptions, which needed to be solved before out-of-order execution would be used outside supercomputers. To have precise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by James E. Smith and Andrew R. Pleszkun.
(Expanded version published in May 1988 a
''Implementing Precise Interrupts in Pipelined Processors''
) The CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions. Smith simulated that adding a reorder buffer (or history buffer or equivalent) to Cray-1S would reduce the performance of executing the first 14

Livermore loops Livermore loops (also known as the Livermore Fortran kernels or LFK) is a Benchmark (computing), benchmark for parallel computing, parallel computers. It was created by Francis H. McMahon from scientific source code run on computers at Lawrence Liv ...

(unvectorized) by only 3%. Important academic research in this subject was led by

Yale Patt Yale Nance Patt is an American professor of electrical and computer engineering at The University of Texas at Austin. He holds the Ernest Cockrell, Jr. Centennial Chair in Engineering. In 1965, Patt introduced the WOS module, the first complex ...

with his HPSm simulator. With the POWER1 (1990) IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a datafull reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named ''program counter stack'' by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have

s needed for out-of-order use of a same execution unit. The next year IBM's ES/9000 model 900 had register renaming also for the general-purpose registers. It also has

s with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The re-ordering distance is up to 32 instructions. The first

superscalar A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a sup ...

single-chip processors ( Intel i960CA in 1989) used a simple

scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependen ...

scheduling like the CDC 6600 had quarter of a century earlier, but in 1992-1996 a rapid advancement of techniques, enabled by increasing transistor counts, saw proliferation down to personal computers. Motorola 88110 (1992) used a history buffer to revert instructions. Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance. PowerPC 601 (1993) was an evolution of the RISC Single Chip, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched-instruction-queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count register could be renamed. In the fall of 1994

NexGen NexGen (Milpitas, California) was a private semiconductor company that designed x86 microprocessors until it was purchased by AMD in 1996. NexGen was a fabless design house that designed its chips but relied on other companies for production ...

and IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first x86 processor capable of out-of-order execution, accomplished with micro-OPs. The re-ordering distance is up to 14 micro-OPs. PowerPC 603 renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry

re-order buffer A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order. The buffer is a circular buffe ...

lets no more than four instructions to overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store. PowerPC 604 (1995) was the first single-chip processor with execution unit-level re-ordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The re-order buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the re-ordering of loads and stores upon cache misses.

HAL SPARC64 SPARC64 is a microprocessor developed by HAL Computer Systems and fabricated by Fujitsu. It implements the SPARC V9 instruction set architecture (ISA), the first microprocessor to do so. SPARC64 was HAL's first microprocessor and was the first in ...

(1995) exceeded the re-ordering capacity of the ES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, and

address generation unit The address generation unit (AGU), sometimes also called address computation unit (ACU), is an execution unit inside central processing units (CPUs) that calculates addresses used by the CPU to access main memory. By having address calculations ...

, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a re-ordered state at a time Pentium Pro (1995) introduced a '' unified reservation station'', which at the 20 micro-OP capacity permitted very flexible re-ordering, backed by a 40-entry re-order buffer. Loads can be re-ordered ahead of both loads and stores. The practically attainable per-cycle rate of execution rose more as full out-of-order execution was further adopted by SGI/ MIPS ( R10000) and HP PA-RISC (

PA-8000 The PA-8000 (PCX-U), code-named ''Onyx'', is a microprocessor developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC 2.0 instruction set architecture (ISA). Hunt 1995 It was a completely new design with no circuitry deriv ...

) in 1996. The same year Cyrix 6x86 and AMD K5 brought advanced re-ordering techniques into mainstream

personal computer A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or te ...

s. Since

DEC Alpha Alpha (original name Alpha AXP) is a 64-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). Alpha was designed to replace 32-bit VAX complex instruction set compute ...

gained out-of-order execution in 1998 ( Alpha 21264), the top-performing out-of-order processor cores have been unmatched by in-order cores other than HP/

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 ser ...

Itanium 2 and IBM POWER6, though the latter had an out-of-order

floating-point unit In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be ...

. The other high-end in-order processors fell far behind, namely Sun's

UltraSPARC III The UltraSPARC III, code-named "Cheetah", is a microprocessor that implements the SPARC V9 instruction set architecture (ISA) developed by Sun Microsystems and fabricated by Texas Instruments. It was introduced in 2001 and operates at 600 to 90 ...

/ IV, and IBM's mainframes which had lost the out-of-order execution capability for the second time, remaining in-order into the z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the

SPARC T series The SPARC T-series family of RISC processors and server computers, based on the SPARC V9 architecture, was originally developed by Sun Microsystems, and later by Oracle Corporation after its acquisition of Sun. Its distinguishing feature from earl ...

and Xeon Phi changed to out-of-order execution in 2011 and 2016 respectively. Almost all processors for phones and other lower-end applications remained in-order until c. 2010. First,

Qualcomm Qualcomm () is an American multinational corporation headquartered in San Diego, California, and incorporated in Delaware. It creates semiconductors, software, and services related to wireless technology. It owns patents critical to the 5G, ...

Scorpion Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, and are easily recognized by a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always en ...

(re-ordering distance of 32) shipped in Snapdragon, and a bit later Arm's A9 succeeded A8. For low-end x86

s in-order early Intel Atoms were first challenged by AMD's

Bobcat The bobcat (''Lynx rufus''), also known as the red lynx, is a medium-sized cat native to North America. It ranges from southern Canada through most of the contiguous United States to Oaxaca in Mexico. It is listed as Least Concern on the ...

, and in 2013 were succeeded by an out-of-order Silvermont. Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in

microcontroller A microcontroller (MCU for ''microcontroller unit'', often also MC, UC, or μC) is a small computer on a single VLSI integrated circuit (IC) chip. A microcontroller contains one or more CPUs ( processor cores) along with memory and programmabl ...

s and

embedded system An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded ...

s, as well as in phone-class cores such as Arm's A55 and A510 in big.LITTLE configurations.

Basic concept

To appreciate OoO Execution it is useful to first describe in-order, to be able to make a comparison of the two. Instructions cannot be completed instantaneously: they take time (multiple cycles). Therefore, results will lag behind where they are needed. In-order still has to keep track of the dependencies. Its approach is however quite unsophisticated: stall, every time. OoO uses much more sophisticated data tracking techniques, as seen below.

In-order processors

In earlier processors, the processing of instructions is performed in an instruction cycle normally consisting of the following steps: # Instruction fetch. # If input

operand In mathematics, an operand is the object of a mathematical operation, i.e., it is the object or quantity that is operated on. Example The following arithmetic expression shows an example of operators and operands: :3 + 6 = 9 In the above exam ...

s are available (in processor registers, for instance), the instruction is dispatched to the appropriate functional unit. If one or more operands are unavailable during the current clock cycle (generally because they are being fetched from

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remember ...

), the processor stalls until they are available. # The instruction is executed by the appropriate functional unit. # The functional unit writes the results back to the

register file A register file is an array of processor registers in a central processing unit (CPU). Register banking is the method of using a single name to access multiple different physical registers depending on the operating mode. Modern integrated circuit- ...

. Often, an in-order processor will have a straightforward " bit vector" which records which registers a pipeline that it will (eventually) write to. If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.

Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps: # Instruction fetch. # Instruction dispatch to an instruction queue (also called instruction buffer or

s). # The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions. # The instruction is issued to the appropriate functional unit and executed by that unit. # The results are queued. # Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage. The key concept of OoOE processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoOE processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data. OoOE processors fill these "slots" in time with other instructions that ''are'' ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order. The benefit of OoOE processing grows as the instruction pipeline deepens and the speed difference between

main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a comput ...

(or cache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.

Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was ''decoupled architecture''. In the earlier ''in-order'' processors, these stages operated in a fairly lock-step, pipelined fashion. The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer. The buffer's purpose is to partition the memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain parallelism between the two. In doing so, it effectively hides all memory latency from the processor's perspective. A larger buffer can, in theory, increase throughput. However, if the processor has a branch misprediction then the entire buffer may need to be flushed, wasting a lot of clock cycles and reducing the effectiveness. Furthermore, larger buffers create more heat and use more

die Die, as a verb, refers to death, the cessation of life. Die may also refer to: Games * Die, singular of dice, small throwable objects used for producing random numbers Manufacturing * Die (integrated circuit), a rectangular piece of a semicondu ...

space. For this reason processor designers today favour a multi-threaded design approach. Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well. Control intensive code include such things as nested branches that occur frequently in

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...

kernels. Decoupled architectures play an important role in scheduling in

very long instruction word Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to exe ...

(VLIW) architectures. To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called register renaming is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.

Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions. The ability to issue instructions past branches that are yet to resolve is known as speculative execution.

Micro-architectural choices

* Are the instructions dispatched to a centralized queue or to multiple distributed queues? : IBM

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple– IBM– ...

processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues. * Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight. :Early Intel out-of-order processors use a results queue called a

, while most later out-of-order processors use register maps. :More precisely: Intel P6 family microprocessors have both a re-order buffer (ROB) and a register alias table (RAT). The ROB was motivated mainly by branch misprediction recovery. :The Intel P6 family is among the earliest OoOE microprocessors but were supplanted by the NetBurst architecture. Years later, Netburst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the Core and Nehalem microarchitectures. The succeeding Sandy Bridge, Ivy Bridge, and Haswell microarchitectures are a departure from the reordering techniques used in P6 and employ re-ordering techniques from the EV6 and the P4 but with a somewhat shorter pipeline.