In computer engineeringout-of-order execution (OoOE or OOE) is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay. In this paradigm, a processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program. In doing so, the processor can avoid being idle while data is retrieved for the next instruction in a program, processing instead the next instructions which is able to run immediately.


Out-of-order execution is a restricted form of data flow computation, which was a major research area in computer architecture in the 1970s and early 1980s. Important academic research in this subject was led by Yale Patt and his HPSm simulator. A paper by James E. Smith and A.R. Pleszkun, published in 1985 completed the scheme by describing how the precise behavior of exceptions could be maintained in out-of-order machines.

Arguably the first machine to use out-of-order execution was probably the CDC 6600 (1964), which used a scoreboard to resolve conflicts. In modern usage, such scoreboarding is considered to be in-order execution, not out-of-order execution, since such machines stall on the first RAW (Read After Write) conflict. Strictly speaking, such machines initiate execution in-order, although they may complete execution out-of-order.

About three years later, the IBM 360/91 (1966) introduced Tomasulo’s algorithm, supporting full out-of-order execution.

In 1990, IBM introduced the first out-of-order microprocessor, the POWER1, although out-of-order execution was limited to floating point instructions only.

Throughout the 1990s out-of-order execution became more common, and was featured in the IBM/Motorola PowerPC 601 (1993), Fujitsu/HAL SPARC64 (1995), Intel Pentium Pro (1995), MIPS R10000 (1996), HP PA-8000 (1996), AMD K5 (1996) and DEC Alpha 21264 (1998). Notable exceptions to this trend include the Sun UltraSPARCHP/Intel ItaniumTransmeta CrusoeIntel Atom, and the IBM POWER6.

The logical complexity of the out-of-order schemes was the reason that this technique did not reach mainstream machines until the mid-1990s. Many low-end processors meant for cost-sensitive markets still do not use this paradigm due to large silicon area that is required to build this class of machine. Low power usage is another design goal that’s harder to achieve with an OoOE design.

[edit]Basic concept

[edit]In-order processors

In earlier processors, the processing of instructions is normally done in these steps:

  1. Instruction fetch.
  2. If input operands are available (in registers for instance), the instruction is dispatched to the appropriate functional unit. If one or more operand is unavailable during the current clock cycle (generally because they are being fetched from memory), the processor stalls until they are available.
  3. The instruction is executed by the appropriate functional unit.
  4. The functional unit writes the results back to the register file.

[edit]Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps:

  1. Instruction fetch.
  2. Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
  3. The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions.
  4. The instruction is issued to the appropriate functional unit and executed by that unit.
  5. The results are queued.
  6. Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.

The key concept of OoO processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoO processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.

OoO processors fill these “slots” in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor’s registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order.

The benefit of OoO processing grows as the instruction pipeline deepens and the speed difference between main memory (or cache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.

Out of Order Execution

In a standard superscalar CPU it is the programmer’s (or compiler’s) responsibility to schedule (arrange) the instructions to avoid hazards and pipeline stalls. Fancier CPUs can actually remove some of this burden and improve performance by automatically rescheduling instructions while the program executes. To understand how this is possible, consider the following instruction sequence:

           mov( SomeVar, ebx );

           mov( [ebx], eax );

           mov( 2000, ecx );

A data hazard exists between the first and second instructions above. The second instruction must delay until the first instruction completes execution. This introduces a pipeline stall and increases the running time of the program. Typically, the stall affects every instruction that follows. However, note that the third instruction’s execution does not depend on the result from either of the first two instructions. Therefore, there is no reason to stall the execution of the “mov( 2000, ecx );” instruction. It may continue executing while the second instruction waits for the first to complete. This technique, appearing in later members of the Pentium line, is called “out of order execution” because the CPU completes the execution of some instruction prior to the execution of previous instructions appearing in the code stream.

Clearly, the CPU may only execute instruction out of sequence if doing so produces exactly the same results as in-order execution. While there a lots of little technical issues that make this problem a little more difficult than it seems, with enough engineering effort it is quite possible to implement this feature.

Although you might think that this extra effort is not worth it (why not make it the programmer’s or compiler’s responsibility to schedule the instructions) there are some situations where out of order execution will improve performance that static scheduling could not handle.


Microarchitecture Pipeline stages Misc
AMD K5 Out-of-order execution, register renaming, speculative execution
AMD K6 Superscalar, branch prediction
AMD K6-III Branch prediction, speculative execution, out-of-order execution[10]
AMD K7 Out-of-order execution, branch prediction, Harvard architecture
AMD K8 64-bit, integrated memory controller, 16 byte instruction prefetching
AMD K10 Superscalar, out-of-order execution, 32-way set associative L3 victim cache, 32-byte instruction prefetching
ARM810 5
ARM1020E 6
XScale PXA210/PXA250 7
ARM1136J(F)-S 8
ARM1156T2(F)-S 9
ARM Cortex-A5 8
ARM Cortex-A8 13
ARM Cortex-A9 Out-of-order, speculative issue, superscalar
ARM Cortex-A15 Multicore (up to 16)
AVR32 AP7 7
AVR32 UC3 3 Harvard architecture
Bobcat Out-of-order execution
Bulldozer Shared L3 cache, multithreading, multicore, integrated memory controller
Crusoe In-order execution, 128-bit VLIW, integrated memory controller
Efficeon In-order execution, 256-bit VLIW, fully integrated memory controller
Cyrix Cx5x86 6[11] Branch prediction
Cyrix 6×86 Superscalar, superpipelined, register renaming, speculative execution, out-of-order execution
EV4 (Alpha 21064) Superscalar
EV7 (Alpha 21364) Superscalar design with out-of-order execution, branch prediction, 4-way SMT, integrated memory controller
EV8 (Alpha 21464) Superscalar design with out-of-order execution
P5 (Pentium) 5 Superscalar
P6 (Pentium Pro) 14 Speculative execution, Register renaming, superscalar design with out-of-order execution
P6 (Pentium II) Branch prediction
P6 (Pentium III) 10
Itanium Speculative execution, branch prediction, register renaming, 30 execution units, multithreading
NetBurst (Willamette) 20 Simultaneous multithreading
NetBurst (Northwood) 20 Simultaneous multithreading
NetBurst (Prescott) 31 Simultaneous multithreading
NetBurst (Cedar Mill) 31 Simultaneous multithreading
Core 14
Intel Atom 16 Simultaneous multithreading, in-order. No instruction reordering, speculative execution, or register renaming.
Nehalem Simultaneous multithreading, integrated memory controller, L1/L2/L3 cache
Sandy Bridge Simultaneous multithreading, multicore, integrated memory controller, L1/L2/L3 cache. 2 threads per core.
Haswell Multicore
LatticeMico32 6 Harvard architecture
POWER1 Supescalar, out-of-order execution
POWER3 Supescalar, out-of-order execution
POWER4 Supescalar, speculative execution, out-of-order execution
POWER5 Simultaneous multithreading, out-of-order execution, integrated memory controller
POWER6 2-way simultaneous multithreading, in-order execution
POWER7 4 SMT threads per core, 12 execution units per core
401PowerPC 401 3
PowerPC 405 5
PowerPC 440 7
PowerPC 470 9 SMP
PowerPC A2 15
PowerPC e300 4 Superscalar, Branch prediction
PowerPC e500 Dual 7 stage Multicore
PowerPC e600 3-issue 7 stage Superscalar out-of-order execution, branch prediction
PowerPC e5500 4-issue 7 stage Out-of-order, multicore
PowerPC 603 4 5 execution units, branch prediction. No SMP.
PowerPC 603q 5 In-order
PowerPC 604 6 Superscalar, out-of-order execution, 6 execution units. SMP support.
PowerPC 620 5 Out-of-order execution- SMP support.
PWRficient Superscalar, out-of-order execution, 6 execution units
R4000 8 Scalar
StrongARM SA-110 5 Scalar, in-order
SuperH SH2 5
SuperH SH2A 5 Superscalar, Harvard architecture
UltraSPARC 9
UltraSPARC T1 6 Open source, multithreading, multi-core, 4 threads per core, integrated memory controller
UltraSPARC T2 8 Open source, multithreading, multi-core, 8 threads per core
SPARC T3 Multithreading, multi-core, 8 threads per core, SMP
VIA C7 In-order execution
VIA Nano (Isaiah) Superscalar out-of-order execution, branch prediction, 7 execution units
WinChip 4 In-order execution