In computer engineering, out-of-order execution (OoOE or OOE) is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay. In this paradigm, a processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program. In doing so, the processor can avoid being idle while data is retrieved for the next instruction in a program, processing instead the next instructions which is able to run immediately.
Out-of-order execution is a restricted form of data flow computation, which was a major research area in computer architecture in the 1970s and early 1980s. Important academic research in this subject was led by Yale Patt and his HPSm simulator. A paper by James E. Smith and A.R. Pleszkun, published in 1985 completed the scheme by describing how the precise behavior of exceptions could be maintained in out-of-order machines.
Arguably the first machine to use out-of-order execution was probably the CDC 6600 (1964), which used a scoreboard to resolve conflicts. In modern usage, such scoreboarding is considered to be in-order execution, not out-of-order execution, since such machines stall on the first RAW (Read After Write) conflict. Strictly speaking, such machines initiate execution in-order, although they may complete execution out-of-order.
About three years later, the IBM 360/91 (1966) introduced Tomasulo’s algorithm, supporting full out-of-order execution.
In 1990, IBM introduced the first out-of-order microprocessor, the POWER1, although out-of-order execution was limited to floating point instructions only.
Throughout the 1990s out-of-order execution became more common, and was featured in the IBM/Motorola PowerPC 601 (1993), Fujitsu/HAL SPARC64 (1995), Intel Pentium Pro (1995), MIPS R10000 (1996), HP PA-8000 (1996), AMD K5 (1996) and DEC Alpha 21264 (1998). Notable exceptions to this trend include the Sun UltraSPARC, HP/Intel Itanium, Transmeta Crusoe, Intel Atom, and the IBM POWER6.
The logical complexity of the out-of-order schemes was the reason that this technique did not reach mainstream machines until the mid-1990s. Many low-end processors meant for cost-sensitive markets still do not use this paradigm due to large silicon area that is required to build this class of machine. Low power usage is another design goal that’s harder to achieve with an OoOE design.
In earlier processors, the processing of instructions is normally done in these steps:
- Instruction fetch.
- If input operands are available (in registers for instance), the instruction is dispatched to the appropriate functional unit. If one or more operand is unavailable during the current clock cycle (generally because they are being fetched from memory), the processor stalls until they are available.
- The instruction is executed by the appropriate functional unit.
- The functional unit writes the results back to the register file.
This new paradigm breaks up the processing of instructions into these steps:
- Instruction fetch.
- Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
- The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions.
- The instruction is issued to the appropriate functional unit and executed by that unit.
- The results are queued.
- Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The key concept of OoO processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoO processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.
OoO processors fill these “slots” in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor’s registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order.
The benefit of OoO processing grows as the instruction pipeline deepens and the speed difference between main memory (or cache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.
Out of Order Execution
In a standard superscalar CPU it is the programmer’s (or compiler’s) responsibility to schedule (arrange) the instructions to avoid hazards and pipeline stalls. Fancier CPUs can actually remove some of this burden and improve performance by automatically rescheduling instructions while the program executes. To understand how this is possible, consider the following instruction sequence:
A data hazard exists between the first and second instructions above. The second instruction must delay until the first instruction completes execution. This introduces a pipeline stall and increases the running time of the program. Typically, the stall affects every instruction that follows. However, note that the third instruction’s execution does not depend on the result from either of the first two instructions. Therefore, there is no reason to stall the execution of the “mov( 2000, ecx );” instruction. It may continue executing while the second instruction waits for the first to complete. This technique, appearing in later members of the Pentium line, is called “out of order execution” because the CPU completes the execution of some instruction prior to the execution of previous instructions appearing in the code stream.
Clearly, the CPU may only execute instruction out of sequence if doing so produces exactly the same results as in-order execution. While there a lots of little technical issues that make this problem a little more difficult than it seems, with enough engineering effort it is quite possible to implement this feature.
Although you might think that this extra effort is not worth it (why not make it the programmer’s or compiler’s responsibility to schedule the instructions) there are some situations where out of order execution will improve performance that static scheduling could not handle.
|AMD K5||Out-of-order execution, register renaming, speculative execution|
|AMD K6||Superscalar, branch prediction|
|AMD K6-III||Branch prediction, speculative execution, out-of-order execution|
|AMD K7||Out-of-order execution, branch prediction, Harvard architecture|
|AMD K8||64-bit, integrated memory controller, 16 byte instruction prefetching|
|AMD K10||Superscalar, out-of-order execution, 32-way set associative L3 victim cache, 32-byte instruction prefetching|
|ARM Cortex-A9||Out-of-order, speculative issue, superscalar|
|ARM Cortex-A15||Multicore (up to 16)|
|AVR32 UC3||3||Harvard architecture|
|Bulldozer||Shared L3 cache, multithreading, multicore, integrated memory controller|
|Crusoe||In-order execution, 128-bit VLIW, integrated memory controller|
|Efficeon||In-order execution, 256-bit VLIW, fully integrated memory controller|
|Cyrix Cx5x86||6||Branch prediction|
|Cyrix 6×86||Superscalar, superpipelined, register renaming, speculative execution, out-of-order execution|
|EV4 (Alpha 21064)||Superscalar|
|EV7 (Alpha 21364)||Superscalar design with out-of-order execution, branch prediction, 4-way SMT, integrated memory controller|
|EV8 (Alpha 21464)||Superscalar design with out-of-order execution|
|P6 (Pentium Pro)||14||Speculative execution, Register renaming, superscalar design with out-of-order execution|
|P6 (Pentium II)||Branch prediction|
|P6 (Pentium III)||10|
|Itanium||Speculative execution, branch prediction, register renaming, 30 execution units, multithreading|
|NetBurst (Willamette)||20||Simultaneous multithreading|
|NetBurst (Northwood)||20||Simultaneous multithreading|
|NetBurst (Prescott)||31||Simultaneous multithreading|
|NetBurst (Cedar Mill)||31||Simultaneous multithreading|
|Intel Atom||16||Simultaneous multithreading, in-order. No instruction reordering, speculative execution, or register renaming.|
|Nehalem||Simultaneous multithreading, integrated memory controller, L1/L2/L3 cache|
|Sandy Bridge||Simultaneous multithreading, multicore, integrated memory controller, L1/L2/L3 cache. 2 threads per core.|
|POWER1||Supescalar, out-of-order execution|
|POWER3||Supescalar, out-of-order execution|
|POWER4||Supescalar, speculative execution, out-of-order execution|
|POWER5||Simultaneous multithreading, out-of-order execution, integrated memory controller|
|POWER6||2-way simultaneous multithreading, in-order execution|
|POWER7||4 SMT threads per core, 12 execution units per core|
|PowerPC e300||4||Superscalar, Branch prediction|
|PowerPC e500||Dual 7 stage||Multicore|
|PowerPC e600||3-issue 7 stage||Superscalar out-of-order execution, branch prediction|
|PowerPC e5500||4-issue 7 stage||Out-of-order, multicore|
|PowerPC 603||4||5 execution units, branch prediction. No SMP.|
|PowerPC 604||6||Superscalar, out-of-order execution, 6 execution units. SMP support.|
|PowerPC 620||5||Out-of-order execution- SMP support.|
|PWRficient||Superscalar, out-of-order execution, 6 execution units|
|StrongARM SA-110||5||Scalar, in-order|
|SuperH SH2A||5||Superscalar, Harvard architecture|
|UltraSPARC T1||6||Open source, multithreading, multi-core, 4 threads per core, integrated memory controller|
|UltraSPARC T2||8||Open source, multithreading, multi-core, 8 threads per core|
|SPARC T3||Multithreading, multi-core, 8 threads per core, SMP|
|VIA C7||In-order execution|
|VIA Nano (Isaiah)||Superscalar out-of-order execution, branch prediction, 7 execution units|