CA: Out-of-Order Execution and Tomasulo's Algorithm
AI-Generated Content
CA: Out-of-Order Execution and Tomasulo's Algorithm
To achieve high performance, modern processors must keep their execution units as busy as possible. A simple in-order pipeline stalls whenever an instruction needs data that isn't ready, wasting precious clock cycles. Out-of-order execution is a fundamental technique that allows a processor to bypass these stalls by executing independent later instructions first, thereby improving overall throughput and lowering the average Cycles Per Instruction (CPI). Implementing this safely and efficiently at the hardware level is the challenge brilliantly addressed by Tomasulo's algorithm.
The Motivation: Exploiting Instruction-Level Parallelism
The primary goal is to exploit Instruction-Level Parallelism (ILP)—the potential for executing multiple instructions simultaneously. Data hazards are the main obstacle. A Read-After-Write (RAW) hazard, a true dependency, forces an instruction to wait for its operands. However, Write-After-Read (WAR) and Write-After-Write (WAW) hazards are false dependencies caused by reuse of the same architectural register names, not actual dataflow. A naive out-of-order scheme would violate these dependencies and produce incorrect results. Tomasulo's algorithm solves this by dynamically renaming registers and using a distributed, reservation-station-based control system to track true dependencies, enabling instructions to execute as soon as their operands are available, regardless of program order.
Core Components: Reservation Stations and the Common Data Bus
Tomasulo's algorithm introduces two key structures that separate it from a centralized scoreboarding approach.
First, reservation stations are buffers attached to each functional unit (e.g., adder, multiplier). When an instruction is issued (dispatched), it waits in a reservation station, not in a central queue. Each station holds the operation and its source operands. Critically, an operand can be in one of two states: if the value is already available, it's stored directly; if it's not ready, the station stores a tag—a label identifying which future result will produce the value. This tag-based tracking is the heart of dynamic dependency management.
Second, results are broadcast to all waiting units via a Common Data Bus (CDB). When a functional unit finishes execution, it puts the result and its source tag onto the CDB. Every reservation station and the register file monitors the CDB. Any station waiting for that tag immediately captures the value, satisfying its dependency. This broadcast mechanism allows results to bypass the architectural register file and flow directly to the units that need them, enabling forwarding on a massive scale.
The Three-Stage Instruction Flow
Instructions progress through three distinct stages: Issue, Execute, and Write-Back.
1. Issue (Dispatch) The processor fetches an instruction from the instruction queue. It checks for a free reservation station for the required functional unit. If one is available, the instruction is issued to it, even if its operands aren't ready. This step performs register renaming: the algorithm reads the architectural registers. If a register's value is currently being computed (indicated by a tag in a Register Alias Table), that tag is stored in the reservation station as the source. If the value is present, the value itself is stored. This renaming eliminates WAR and WAW hazards by ensuring each result writes to a unique tag (implicitly associated with the producing reservation station).
2. Execute The instruction remains in its reservation station until all of its source operands are real values (i.e., no longer pending tags). Once operands are ready, the functional unit begins execution. If multiple instructions in the same unit's stations become ready, they may execute out of order. A crucial note: for load/store instructions, address calculation must also wait for its operands, and stores must wait for data. Memory accesses typically execute in program order relative to other memory ops to maintain correctness.
3. Write-Back When execution finishes, the result is placed on the Common Data Bus (CDB). The result, along with its unique tag, is broadcast. The CDB is captured by: (1) all reservation stations waiting for this tag, (2) the register file, if this is the most recent (architecturally) update to that register, and (3) any store instructions waiting for this data. This completes the instruction's execution.
Register Renaming and Elimination of False Dependencies
The magic of hazard elimination lies in the combination of reservation stations and the CDB. Consider this sequence:
MUL.D F2, F0, F4
ADD.D F4, F2, F8
SUB.D F6, F4, F2Here, a WAR hazard exists between the ADD.D and SUB.D on F4, and a WAW hazard if the SUB.D was another ADD.D to F4. In Tomasulo's algorithm:
- The MUL.D issues to a Multiplier reservation station (say,
M1). Its destination register F2 is now associated with tagM1. - The ADD.D issues to an Adder station (
A1). It needs F2, so it stores tagM1as a source. It will write to F4, so the register alias table now points F4 to tagA1. - The SUB.D issues to another Adder station (
A2). It needs F4, but the current tag for F4 isA1, soA1is stored as a source—not the old value in the F4 register file. The SUB.D's result is destined for F6, a different register, so no conflict.
The ADD.D and SUB.D are both waiting for tags (M1 and A1 respectively). They have no false dependency between them because they reference different destination tags (A1 vs A2). When M1's result broadcasts, the ADD.D can proceed, and later, its result on tag A1 will allow the SUB.D to proceed. True dependencies (RAW) are preserved via tag matching, but the reuse of the name F4 causes no stall.
Analyzing Performance Impact on CPI
The primary performance benefit is a reduction in the average CPI. In a perfect in-order pipeline, CPI is ideally 1, but stalls for data hazards increase it. Out-of-order execution with Tomasulo's algorithm reduces the stall cycles caused by latency.
For example, a long-latency operation like a floating-point multiply or a cache-missing load no longer blocks the entire pipeline. Independent integer instructions, branches, or other FP operations can continue issuing, executing, and completing during that latency. The throughput of the processor increases because functional unit utilization improves. The effective CPI moves closer to 1 (or even lower with superscalar issue), limited primarily by the true data dependencies in the instruction stream and the available hardware resources (number of reservation stations, CDB bandwidth, functional units).
Common Pitfalls
Misunderstanding the Write-Back Stage: It's easy to think the result is written directly to the register file at write-back. Remember, the CDB broadcast is the write-back. The register file is just one of many subscribers. If an older instruction (in program order) is still calculating a value for the same architectural register, a newer instruction's result might reach the register file first. The algorithm ensures only the programmatically last result updates the register file by managing tags, preserving correctness.
Confusing Issue with Execution Start: A key insight is that an instruction leaves the issue stage and enters a reservation station before its operands are ready. Execution does not begin until later, when operands become available. This decoupling of issue from operand readiness is what allows the processor to look deep into the instruction window and find independent work.
Overlooking CDB as a Bottleneck: The Common Data Bus (CDB) is a single, shared resource. In a given cycle, only one result can be broadcast. If multiple functional units finish simultaneously, they must arbitrate for the CDB, causing winners to broadcast and losers to stall. This serialization can become a performance bottleneck in designs with many parallel units, a limitation addressed in modern designs with multiple result buses.
Assuming Memory Operations Execute Fully Out-of-Order: While address calculation for loads can proceed out-of-order, the actual memory access and the order of stores are typically constrained to maintain coherent memory semantics. Most Tomasulo implementations enforce that memory instructions commit in program order.
Summary
- Tomasulo's algorithm enables out-of-order execution by using reservation stations to buffer operations and a Common Data Bus (CDB) to broadcast results, dynamically resolving data dependencies.
- It eliminates false dependencies (WAR/WAW) through implicit register renaming, where destination registers are mapped to unique tags associated with the producing reservation station.
- Instructions progress through Issue, Execute, and Write-Back stages, executing in the Execute stage only when all operand values are ready, which may be in a different order than they were issued.
- The algorithm improves processor throughput and reduces average CPI by hiding the latency of long operations, allowing independent instructions to proceed and keeping functional units busy.
- Performance can be limited by the number of reservation stations (instruction window size) and contention for the CDB, which is a critical shared resource.