| Dynamic scheduling                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Out-of-order execution: basics                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Last time: data hazards that prevent instruction issue were hidden by: <ul> <li>Forwarding</li> <li>Static scheduling by the compiler</li> </ul> </li> <li>Dynamic scheduling is also possible: <ul> <li>CPU rearranges the instructions (while preserving dependences) to reduce stalls</li> </ul> </li> <li>Dynamic scheduling has several advantages over static <ul> <li>Handles dependencies that are UNKNOWN at compile time such as</li> <li>Memory references</li> <li>Branches</li> </ul> </li> <li>Allows code compiled with one pipeline in mind to run efficiently on a different pipeline</li> </ul> | <ul> <li>Until now, all techniques require in-order instruction issue <ul> <li>A stalled instruction holds up those behind it</li> </ul> </li> <li>What if following instructions could "pass" the stalled one? <ul> <li>DIVD F0,F2,F4 ; long latency</li> <li>ADDD F10,F0,F8 ; stalled waiting for F0</li> <li>SUBD F12,F8,F14 ; could proceed with this one!</li> </ul> </li> <li>Out-of-order execution: allow instructions to issue in any order as long as dependencies aren't violated <ul> <li>Execute SUBD before ADDD in above example, reducing stalls</li> <li>Handle out-of-order completion <ul> <li>May cause problems handling exceptions</li> <li>May not gain if there are long dependence chains</li> </ul> </li> </ul></li></ul> |
| ON ON DC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Implementing out-of-order execution                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Scoreboarding                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |



## More pipeline changes for scoreboarding

- Read operands (RD)
  - Read operation is delayed until operands are available
    - ⇒No previously issued but uncompleted instruction has the operand as its destination
  - RAW hazards resolved dynamically
- Execution (EX) stage changed
  - Notify the scoreboard when EX is completed
    - Allow a new instruction to use the functional unit
  - EX may take multiple cycles if necessary

## Writeback (WB) with scoreboarding

- The scoreboard checks for WAR hazards and stalls the completing instruction if necessary
  - In the earlier example, SUBD would be stalled in WB until ADDD reads its operands
- Writeback is stalled if
  - A preceding instructions has not read its operands and
  - One of the operands is the same register as the destination of the completing instruction
- The DLX pipeline is now six cycles long
  - IF IS RD EX MEM WB

UMBC

- Forwarding is not used here: not a large penalty since write-back occurs as soon as the result is available
- Instructions that do NOT need the MEM stage don't execute it

7

Chapter 4

2-Mar-00

Chapter 4



Mult

-

Add

Div

F2

F4

UMBC

F0

Fб

F10

Fб

Add

F8

FO

Register result status

F8

F4

F2

F6

F10

Div

CMSC 611 (Advanced Computer Architecture), Spring 2000

M11] † 1

No

-

No

No

RAW from WAR

Chapter 4

F12 ... F30 Distinguishes

No

\_

No

Yes

12

Yes

No

Yes

Yes

F0

FuncUnit Mult1

Mult1

Mult2

Divide

Add

2-Mar-00



|                         |                  |        | In      | structio       | n status      |                |       |                |       |  |
|-------------------------|------------------|--------|---------|----------------|---------------|----------------|-------|----------------|-------|--|
| Instruction Issue       |                  | e Read | operan  | ds E           | Exec complete |                |       | Write result   |       |  |
| LD <b>F6</b> , 40(R2) X |                  | Х      |         | X              |               | X              |       |                | Х     |  |
|                         | ,52(R3)          |        |         | X              |               | X              |       |                | X     |  |
| MULTD F0, F2, F4        |                  | Х      |         | Х              |               | X              |       |                | X     |  |
|                         |                  | Х      |         | Х              |               | Х              |       |                | Х     |  |
|                         |                  |        |         | X              |               | X              |       |                |       |  |
| ADDD F6                 | , F8 , <b>F2</b> | Х      |         | X X            |               |                | X     |                |       |  |
|                         |                  |        | Func    | tion un        | it status     |                |       |                |       |  |
| Name                    | Busy             | Op     | Fi      | F <sub>i</sub> | $F_k$         | Q <sub>i</sub> | $Q_k$ | R <sub>i</sub> | $R_k$ |  |
| Integer                 | No               | -      | -       | _              | -             | -              | -     | -              | -     |  |
| Mult1                   | Yes              | Mult   | -       | -              | -             | -              | -     | -              | -     |  |
| Mult2                   | No               | -      | -       | -              | -             | -              | -     | -              | -     |  |
| Add                     | Yes              | Add    | -       | -              | -             | -              | -     | -              | -     |  |
| Divide                  | Yes              | Div    | F1(     | ) F0           | Fб            | -              | -     | No             | No    |  |
|                         |                  |        | Registe | er resi        | ılt sta       | tus            |       |                |       |  |
|                         | FO               | F2     | F4 F    |                |               |                | F30   |                |       |  |
| FuncUnit                |                  |        |         |                | Di            | v              |       |                |       |  |

### Scoreboard limitations

- ILP: if there aren't any independent instructions to execute, scoreboarding and other dynamic techniques don't help much
- Size of the "issued" queue (the **window**)
  - Determines how far ahead the CPU can look for instructions
  - For now, assume that a window cannot span a branch
    - Window includes instructions only within basic blocks
    - The window can be extended beyond the branch: details later
- Number, types, and speed of the functional units
- Presence of antidependences and output dependences
  - WAR and WAW hazards limit scoreboard more than RAW hazards
  - RAW hazards are problems for any technique
  - WAR and WAW hazards can be solved using other mechanisms

# Handling hazards with a scoreboard

#### • RAW hazards

- Detect RAW hazards by checking to see if a source register is listed in the Register Result Status table
- $\Rightarrow$  If it is, we have a RAW hazard
- If the pending instruction is receiving a value from the current instruction, then set one of the pending instruction's R<sub>i</sub>/R<sub>k</sub> fields to No
- WAR hazards
  - Before writing the value, check to make sure that no pending instruction is using a previous value for the register to be modified
  - If some pending instruction has already "received" the value it needs but hasn't yet read it, then  $R_j/R_k$  is set to **Yes** and any instruction writing the register must stall (WAR)
- This is how we distinguish between a RAW and WAR

(3)

2-Mar-00

UMBC CMSC 611 (Advanced Computer Architecture), Spring 2000

Chapter 4 14

## Tomasulo's approach

- Tomasulo's approach is a technique to allow execution to proceed in the presence of hazards
  - First introduced in the IBM 360/91
  - Applied only to floating-point operations (including FP memory ops)
- Uses renaming to avoid WAW and WAR hazards
  - Compiler can rename registers (statically) to avoid WAW and WAR hazards
  - Tomasulo's scheme performs this function dynamically
    - Buffers operands of instructions waiting to issue, fetching them as soon as they are available, avoiding the register file
    - The register specifiers of instructions are renamed to reservation station numbers as they are issued, *eliminating* **WAW** and **WAR** hazards

2-Mar-00

15

Chapter 4

#### Scoreboarding vs. Tomasulo's approach • Register renaming - Register renaming is used to eliminate WAR and WAW hazards - Scoreboarding must wait for WAR and WAW hazards to clear Distributed control - Hazard detection and execution control are distributed to each functional unit - Scoreboarding has a centralized control unit • Common Data Bus - Used to forward results directly to the functional units without going through the register file - Scoreboarding connects each functional unit to the register file **UMBC** 2-Mar-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 4 17 2-Mar-00 Tomasulo's approach: issue stage

- Take an instruction from the instruction queue
  - If there's a station available for it, send the instruction to the station
  - Otherwise, stall for a structural hazard
- This step checks to see if the source operands will be produced by a current instruction
  - If so, renaming is done by checking to see if the desired register is being written by an instruction already at a reservation station
    - If the value is not being generated by a functional unit, it is fetched from the register file
    - If the value is being generated, the name of the reservation station generating the result is used instead
  - If the operation is a load or a store, it can issue if there is an available load or store buffer

# Tomasulo's approach: design

- Reservation stations are the heart of Tomasulo's approach
  - Located at each functional unit (may be more reservations than func units)
  - Hold values for each computation before it begins



### Tomasulo's approach: execute & WB

- Execute
  - If at least one operand is missing, monitor the CDB until it is generated
  - When a needed operand is put out onto the CDB, it is placed into the appropriate reservation station
  - When both operands are ready, the operation is executed
  - $\Rightarrow$  RAW hazards are handled here
- Write result
  - When the result is ready, write it on the CDB and into the register file and any waiting reservation station
    - $\Rightarrow$ Only one value can be written on the CDB in any single cycle!
  - Indicate that the reservation station is no longer busy

19

Chapter 4

2-Mar-00

20



### Tomasulo's approach: advantages

- Hazard detection logic is distributed
  - If multiple instructions are waiting on the second of two operands, the instructions can be released simultaneously broadcasting on the CDB
- WAW and WAR hazards are eliminated because
  - Register renaming is performed using the reservation stations.
  - Operands are stored into the reservation tables as soon as they are available
- The WAR hazard was eliminated because the reservation station held the value of F6 for the DIVD instruction
  - Even if LD F6, 40(R2) hadn't completed before the DIVD had issued
    - The WAR hazard & potential WAW hazard are eliminated
    - Q<sub>k</sub> would point to the Load1 reservation table for the value of F6

## Tomasulo's approach: loop unrolling

- Loop unrolling is performed dynamically !
- With only 4 FP registers, WAW and WAR hazards would severely limit loop unrolling, even by the compiler
  - Virtual registers provided by the reservation stations make it possible to execute multiple iterations of some loops simultaneously
- Memory disambiguation
  - Since the store functional unit keeps a memory address as well as a value, it's possible to do disambiguation
  - When a memory operation is issued, check to see if that location is already involved in an operation
- $\Rightarrow$  LOADs and STOREs from different iterations of the loop can be executed *non-sequentially*

23

Chapter 4

Write result

Х

Load2

Chapter 4

22

24

Add

Load2

Mult1

F12 ... F30

Х

X

V<sub>v</sub>

F10