In
computer engineering
Computer engineering (CoE or CpE) is a branch of electrical engineering and computer science that integrates several fields of computer science and electronic engineering required to develop computer hardware and software. Computer enginee ...
, out-of-order execution (or more formally dynamic execution) is a
paradigm used in most high-performance
central processing unit
A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...
s to make use of
instruction cycle
The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...
s that would otherwise be wasted. In this paradigm, a processor executes
instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program.
In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.
History
Out-of-order execution is a restricted form of
data flow
In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming.
Software architecture
Da ...
computation, which was a major research area in
computer architecture
In computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, the ...
in the 1970s and early 1980s.
The first machine to use out-of-order execution was the
CDC 6600
The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM ...
(1964), designed by
James E. Thornton, which uses a
scoreboard
A scoreboard is a large board for publicly displaying the score in a game. Most levels of sport from high school and above use at least one scoreboard for keeping score, measuring time, and displaying statistics. Scoreboards in the past used ...
to avoid conflicts. It permits an instruction to execute if its source operand (read) addresses aren't to be written to by any unexecuted earlier instruction (true dependency) and the destination (write) address not be an address used by any unexecuted earlier instruction (false dependency). The 6600 lacks the means to avoid stalling an
execution unit
In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
on false dependencies (
write after write (WAW) and
write after read
In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect com ...
(WAR) conflicts, respectively termed "first order conflict" and "third order conflict" by Thornton, who termed true dependencies (
read after write (RAW)) as "second order conflict") because each address has only a single location referable by it. The WAW is worse than WAR for the 6600, because when an execution unit encounters a WAR, the other execution units still receive and execute instructions, but upon a WAW the assignment of instructions to execution units stops, and they can not receive any further instructions until the WAW-causing instruction's destination register has been written to by earlier instruction.
About two years later, the
IBM System/360 Model 91
The IBM System/360 Model 91 was announced in 1964 as a competitor to the CDC 6600. Functionally, the Model 91 ran like any other large-scale System/360, but the internal organization was the most advanced of the System/360 line, and it was the ...
(1966) introduced
register renaming
In computer architecture, register renaming is a technique that abstracts logical registers from physical registers.
Every logical register has a set of physical registers associated with it.
When a machine language instruction refers to a partic ...
with
Tomasulo's algorithm
Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in ...
, which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''r
n'' can be executed before an earlier instruction using the register ''r
n'' is executed, by actually writing into an alternative (renamed) register ''alt-r
n'', which is turned into a normal ("architectural") register ''r
n'' only when all the earlier instructions addressing ''r
n'' have been executed, but until then ''r
n'' is given for earlier instructions and ''alt-r
n'' for later ones addressing ''r
n''. Another advantage the Model 91 has over the 6600 is the ability to execute out-of-order the instructions ''at the same
execution unit
In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
'', not just between the units like the 6600. This is accomplished by
reservation stations, from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of the 6600. Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point code. The 91 and 6600 both also suffer from
imprecise exceptions, which needed to be solved before out-of-order execution would be used outside supercomputers.
To have precise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by
James E. Smith and Andrew R. Pleszkun.
(Expanded version published in May 1988 a
''Implementing Precise Interrupts in Pipelined Processors''
) The
CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions.
Smith simulated that adding a reorder buffer (or history buffer or equivalent) to
Cray-1S would reduce the performance of executing the first 14
Livermore loops
Livermore loops (also known as the Livermore Fortran kernels or LFK) is a benchmark for parallel computers. It was created by Francis H. McMahon from scientific source code run on computers at Lawrence Livermore National Laboratory. It consists of ...
(unvectorized) by only 3%.
Important academic research in this subject was led by
Yale Patt
Yale Nance Patt is an American professor of electrical and computer engineering at The University of Texas at Austin. He holds the Ernest Cockrell, Jr. Centennial Chair in Engineering. In 1965, Patt introduced the WOS module, the first complex ...
with his
HPSm simulator.
With the
POWER1 (1990) IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a datafull reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named ''program counter stack'' by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have
reservation stations needed for out-of-order use of a same execution unit. The next year IBM's
ES/9000 model 900 had register renaming also for the general-purpose registers. It also has
reservation stations with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The re-ordering distance is up to 32 instructions.
The first
superscalar
A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a sup ...
single-chip processors (
Intel i960
Intel's i960 (or 80960) was a RISC-based microprocessor design that became popular during the early 1990s as an embedded microcontroller. It became a best-selling CPU in that segment, along with the competing AMD 29000. In spite of its succes ...
CA in 1989) used a simple
scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available.
In a scoreboard, the data depende ...
scheduling like the CDC 6600 had quarter of a century earlier, but in 1992-1996 a rapid advancement of techniques, enabled by
increasing transistor counts, saw proliferation down to personal computers.
Motorola 88110 (1992) used a history buffer to revert instructions. Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance.
PowerPC 601 The PowerPC 600 family was the first family of PowerPC processors built. They were designed at the Somerset facility in Austin, Texas, jointly funded and staffed by engineers from IBM and Motorola as a part of the AIM alliance. Somerset was opene ...
(1993) was an evolution of the
RISC Single Chip, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched-instruction-queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count register could be renamed. In the fall of 1994
NexGen and
IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first
x86 processor capable of out-of-order execution, accomplished with
micro-OPs. The re-ordering distance is up to 14 micro-OPs.
PowerPC 603 renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry
re-order buffer
A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order.
The buffer is a circular buff ...
lets no more than four instructions to overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store.
PowerPC 604 (1995) was the first single-chip processor with
execution unit
In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
-level re-ordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The re-order buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the re-ordering of loads and stores upon cache misses.
HAL SPARC64 (1995) exceeded the re-ordering capacity of the
ES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, and
address generation unit
The address generation unit (AGU), sometimes also called address computation unit (ACU), is an execution unit inside central processing units (CPUs) that calculates addresses used by the CPU to access main memory. By having address calculations ...
, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a re-ordered state at a time
Pentium Pro
The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It introduced the P6 microarchitecture (sometimes termed i686) and was originally intended to replace the original ...
(1995) introduced a ''
unified reservation station'', which at the 20 micro-OP capacity permitted very flexible re-ordering, backed by a 40-entry re-order buffer. Loads can be re-ordered ahead of both loads and stores.
The practically attainable
per-cycle rate of execution rose more as full out-of-order execution was further adopted by
SGI/
MIPS (
R10000) and
HP PA-RISC
PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to a ...
(
PA-8000) in 1996. The same year
Cyrix 6x86
The Cyrix 6x86 is a line of sixth-generation, 32-bit x86 microprocessors designed and released by Cyrix in 1995. Cyrix, being a fabless company, had the chips manufactured by IBM and SGS-Thomson. The 6x86 was made as a direct competitor to I ...
and
AMD K5
The K5 is AMD's first x86 processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel's Pentium microprocessor. The K5 was an ambitious design, closer to a Pentium Pro than a Pentium regarding techn ...
brought advanced re-ordering techniques into mainstream
personal computer
A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tech ...
s. Since
DEC Alpha
Alpha (original name Alpha AXP) is a 64-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). Alpha was designed to replace 32-bit VAX complex instruction set computer ...
gained out-of-order execution in 1998 (
Alpha 21264
The Alpha 21264 is a Digital Equipment Corporation RISC microprocessor launched on 19 October 1998. The 21264 implemented the Alpha instruction set architecture (ISA).
Description
The Alpha 21264 is a four-issue superscalar microprocessor with ...
), the top-performing out-of-order processor cores have been unmatched by in-order cores other than
HP/
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the devel ...
Itanium
Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comp ...
2 and
IBM POWER6
The POWER6 is a microprocessor developed by IBM that implemented the Power ISA v.2.03. When it became available in systems in 2007, it succeeded the POWER5+ as IBM's flagship Power microprocessor. It is claimed to be part of the eCLipz projec ...
, though the latter had an out-of-order
floating-point unit
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
. The other high-end in-order processors fell far behind, namely
Sun's
UltraSPARC III
The UltraSPARC III, code-named "Cheetah", is a microprocessor that implements the SPARC, SPARC V9 instruction set architecture (ISA) developed by Sun Microsystems and fabricated by Texas Instruments. It was introduced in 2001 and operates at 600 ...
/
IV, and IBM's
mainframes
A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterpris ...
which had lost the out-of-order execution capability for the second time, remaining in-order into the
z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the
SPARC T series and
Xeon Phi changed to out-of-order execution in 2011 and 2016 respectively.
Almost all processors for phones and other lower-end applications remained in-order until c. 2010. First,
Qualcomm's
Scorpion
Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, and are easily recognized by a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always endi ...
(re-ordering distance of 32) shipped in
Snapdragon
''Antirrhinum'' is a genus of plants commonly known as dragon flowers, snapdragons and dog flower because of the flowers' fancied resemblance to the face of a dragon that opens and closes its mouth when laterally squeezed. They are native to ...
, and a bit later
Arm's
A9 succeeded
A8. For low-end
x86 personal computer
A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tech ...
s
in-order early Intel Atoms were first challenged by
AMD's
Bobcat
The bobcat (''Lynx rufus''), also known as the red lynx, is a medium-sized cat native to North America. It ranges from southern Canada through most of the contiguous United States to Oaxaca in Mexico. It is listed as Least Concern on the IU ...
, and in 2013 were succeeded by an out-of-order
Silvermont. Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in
microcontroller
A microcontroller (MCU for ''microcontroller unit'', often also MC, UC, or μC) is a small computer on a single VLSI integrated circuit (IC) chip. A microcontroller contains one or more CPUs ( processor cores) along with memory and programma ...
s and
embedded system
An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded'' ...
s, as well as in phone-class cores such as Arm's
A55 and
A510 in
big.LITTLE
ARM big.LITTLE is a heterogeneous computing architecture developed by ARM Holdings, coupling relatively battery-saving and slower processor cores (''LITTLE'') with relatively more powerful and power-hungry ones (''big''). Typically, only one " ...
configurations.
Basic concept
To appreciate OoO Execution it is useful to first describe in-order, to be able to make a comparison of the two. Instructions cannot be completed instantaneously: they take time (multiple cycles). Therefore, results will lag behind where they are needed. In-order still has to keep track of the dependencies. Its approach is however quite unsophisticated: stall, every time. OoO uses much more sophisticated data tracking techniques, as seen below.
In-order processors
In earlier processors, the processing of instructions is performed in an
instruction cycle
The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...
normally consisting of the following steps:
#
Instruction fetch.
# If input
operand
In mathematics, an operand is the object of a mathematical operation, i.e., it is the object or quantity that is operated on.
Example
The following arithmetic expression shows an example of operators and operands:
:3 + 6 = 9
In the above exa ...
s are available (in processor registers, for instance), the instruction is dispatched to the appropriate
functional unit
In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
. If one or more operands are unavailable during the current clock cycle (generally because they are being fetched from
memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...
), the processor stalls until they are available.
# The instruction is executed by the appropriate functional unit.
# The functional unit writes the results back to the
register file
A register file is an array of processor registers in a central processing unit (CPU). Register banking is the method of using a single name to access multiple different physical registers depending on the operating mode. Modern integrated circuit ...
.
Often, an in-order processor will have a straightforward "
bit vector
A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level p ...
" which records which registers a pipeline that it will (eventually) write to. If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.
Out-of-order processors
This new paradigm breaks up the processing of instructions into these steps:
# Instruction fetch.
# Instruction dispatch to an instruction queue (also called instruction buffer or
reservation stations).
# The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions.
# The instruction is issued to the appropriate functional unit and executed by that unit.
# The results are queued.
# Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The key concept of OoOE processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoOE processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions that ''are'' ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order.
The benefit of OoOE processing grows as the
instruction pipeline
In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing inco ...
deepens and the speed difference between
main memory
Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processing unit (CPU) of a comput ...
(or
cache memory
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.
Dispatch and issue decoupling allows out-of-order issue
One of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was ''decoupled architecture''. In the earlier ''in-order'' processors, these stages operated in a fairly
lock-step, pipelined fashion.
The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the
fetch and decode stages from the execute stage in a
pipelined processor by using a
buffer.
The buffer's purpose is to partition the
memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain
parallelism between the two. In doing so, it effectively hides all
memory latency
''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will ha ...
from the processor's perspective.
A larger buffer can, in theory, increase throughput. However, if the processor has a
branch misprediction then the entire buffer may need to be flushed, wasting a lot of
clock cycles and reducing the effectiveness. Furthermore, larger buffers create more heat and use more
die space. For this reason processor designers today favour a
multi-threaded design approach.
Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well. Control intensive code include such things as nested branches that occur frequently in
operating system
An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
kernels. Decoupled architectures play an important role in scheduling in
very long instruction word
Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to exe ...
(VLIW) architectures.
To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called
register renaming
In computer architecture, register renaming is a technique that abstracts logical registers from physical registers.
Every logical register has a set of physical registers associated with it.
When a machine language instruction refers to a partic ...
is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.
Execute and writeback decoupling allows program restart
The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions.
The ability to issue instructions past branches that are yet to resolve is known as
speculative execution
Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing ...
.
Micro-architectural choices
* Are the instructions dispatched to a centralized queue or to multiple distributed queues?
:
IBM PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple– IBM ...
processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues.
* Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight.
:Early Intel out-of-order processors use a results queue called a
re-order buffer
A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order.
The buffer is a circular buff ...
, while most later out-of-order processors use register maps.
:More precisely: Intel
P6 family microprocessors have both a re-order buffer (ROB) and a
register alias table (RAT). The ROB was motivated mainly by branch misprediction recovery.
:The Intel
P6 family is among the earliest OoOE microprocessors but were supplanted by the
NetBurst architecture. Years later, Netburst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the
Core and
Nehalem microarchitectures. The succeeding
Sandy Bridge
Sandy Bridge is the codename for Intel's 32 nm microarchitecture used in the second generation of the Intel Core processors ( Core i7, i5, i3). The Sandy Bridge microarchitecture is the successor to Nehalem and Westmere microarchitecture ...
,
Ivy Bridge, and
Haswell microarchitectures are a departure from the reordering techniques used in P6 and employ re-ordering techniques from the
EV6 and the
P4 but with a somewhat shorter pipeline.
See also
*
Dataflow architecture
*
Memory fence
In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued be ...
*
Replay system
The replay system is a subsystem within the Intel Pentium 4 processor. Its primary function is to catch operations that have been mistakenly sent for execution by the processor's scheduler. Operations caught by the replay system are then re-execu ...
*
Scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available.
In a scoreboard, the data depende ...
*
Shelving buffer
A shelving buffer is a technique used in computer processors to increase the efficiency of superscalar processors. It allows for multiple instructions to be dispatched at once regardless of the data dependencies between those instructions. This a ...
*
Tomasulo algorithm
References
*
Further reading
*
{{DEFAULTSORT:Out-Of-Order Execution
Instruction processing