Solution to HW4 of CMSC611
Problem 1
Dependency:
Actually the value computed by S3 will be replaced by the value computed by next iteration. So S3 is useless.
for (i = 1; i<100; i = i+1){ a[i] = b[i] + c[i];
b[i] = a[i] + d[i];
}
a[100] = a[99] + e[99];
Problem 2
MULTD F0, F0, F4LD F1, -8(R1)
LD F5, - 8(R2)
LD F3, -16(R1)
MULTD F1, F1, F5
ADDD F2, F0, F2
LD F6, -16(R2)
SUB R1, R1, 24
MULTD F3, F3, F6]
ADDD F2, F1, F2
BNEQZ R1, loop
SUB R2, R2, 24
ADDD F2, F2, F3
Speedup = 7/(13/3) = 1.62
Problem 3
State of scoreboard:
Instruction status
Instruction | Issue | Read | Execution | Write |
LD F8,0(R3) | Yes | Yes | Yes | Yes |
ADDD F6, F4, F2 | Yes | Yes | Yes | Yes |
SUBD F4, F8, F0 | Yes | Yes | Yes | Yes |
MULTD F10, F4, F0 | Yes | |||
DIVD F12, F8, F2 | Yes | Yes | ||
LD F14, 0(R4) | Yes | Yes | Yes | Yes |
MULTD F0, F6, F8 | Yes |
Function units status
BUSY | OP | Fi | Fj | Fk | Qj | Qk | Rj | Rk | |
Integer | NO | ||||||||
ADD1 | NO | ||||||||
ADD2 | NO | ||||||||
MUL1 | Yes | MUL | F10 | F4 | F0 | ADD2 | Yes | Yes | |
MUL2 | Yes | MUL | F0 | F6 | F8 | Yes | Yes | ||
DIV | Yes | DIV | F12 | F8 | F2 | No | No |
Register status
F0 | F2 | F4 | F6 | F8 | F10 | F12 | |
FU | MUL | MUL | DIV |
State of Tomasulo's algorithm:
Instruction status
Instruction | Issue | Execute | Write Result |
LD F8, 0(R3) | Yes | Yes | Yes |
ADDD F6, F4, F2 | Yes | Yes | Yes |
SUBD F4, F8, F0 | Yes | Yes | Yes |
MULTD F10, F4, F0 | Yes | ||
DIVD F12, F8, F2 | Yes | Yes | Yes |
LD F14, 0(R4) | Yes | Yes | |
MULTD F0, F6, F8 | Yes |
Reservation stations
Name | Busy | Op | Vj | Vk | Oj | Qk |
Integer | No | |||||
ADD1 | No | |||||
ADD 2 | No | |||||
MUL1 | Yes | Mul | Regs[F0] | ADD2 | ||
MUL2 | No | |||||
DIV | Yes | DIV | Regs[F8] | Regs[F2] |
Register status
Field | F0 | F2 | F4 | F6 | F8 | F10 | F12 |
MUL | DIV |
Problem 4
a. For loop2, the percentage of the branch instruction taken is: 4/5 = 80%
the percentage of the branch instruction not taken is: 1/5 = 20%b. For both loop1 and loop2, the best static branch prediction scheme is predict taken. Percentage of correct prediction is 80%.
c. Predictor was initialized to taken "T":
R4 = | Loop2
Prediction |
Loop2
Action |
New Loop2
Prediction |
4
3 2 1 0 |
T
T T T T |
T
T T T NT -x--> |
T
T T T NT |
4
3 2 1 0 |
NT
T T T T |
T
T T T NT -x--> |
T
T T T NT |
One-bit predictor for Loop1:
R8= | Loop1
Prediction |
Loop1
Action |
New Loop1
prediction |
7 | T | T | T |
7 | T | T | T |
When outer loop is executed n times, the inner loop will executed 5n times. Every iteration, the predictor will predict wrongly 2 times except for the first time. At the first time the inner loop execute, it will predict wrong for 1 time. All the following, it will predict wrong 2 times out of 5 predictions. So the misprediction rate is (2n-1)/5n.The misprediction rate is about 40%.
d. Predictor was initialized to taken "T"(11):
The bits represent:( 00-NT,01-NT,10-T,11-T)
Two-bit predictor for Loop2:
R4= | Loop2
Prediction |
Loop2
Action |
New Loop2
Prediction |
|
4
3 2 1 0 |
T(11)
T(11) T(11) T(11) T(11) |
T
T T T NT |
T(11)
T(11) T(11) T(11) T(10) |
|
4
3 2 1 0 |
T(10)
T(11) T(11) T(11) T(11) |
T
T T T NT |
T(11)
T(11) T(11) T(11) T(10) |
|
Two-bit predictor for Loop1:
R8= | Loop1
Prediction |
Loop1
Action |
New Loop1
Prediction |
7
7 |
T(11)
T(11) |
T
T |
T(11)
T(11) |
If we initialize the predictor to be "11", the misprediction rate is 0.
Problem 5
Branch penalty a:
80% * (1-90%) * 5 + 20% * 3 = 0.4 + 0.6 = 1
Branch penalty b:
60% * 2 + (1-60%) * 0 = 1.2
speedup = (1.2+1.2*15%)/(1.2+1 * 15%) = 1..022
Problem 6
1.Issue:
If there is an empty entry both in reservation buffer and in history buffer, we issue the instruction, otherwise stall it. Sending operands to the reservation station are available in the register file.2.Execute:
If one or two of the operands is not yet available, monitoring CDB while waiting for register to be computed. This step checks the RAW hazard.3.Write result:
when result available, write in on the CDB, from there into the reservation station for this result, and write back to register file, before doing it, make a original of this register in the "history buffer" and fill in history buffer fields. If it is a store instruction, put it in the store buffer, don't write the result memory for now.
Reservation stations
Name | Busy | Op | Vj | Vk | Qj | Qk | Dest |
Nult1 | No | Nult | Mem[Reg[R1]] | [Reg[F2] | #2 | ||
Nult2 | No | Nult | Mem[Reg[R1]] | Reg[F2] | #7 |
History buffer
Entry | Busy | Instruction | Register | Value | Speculative |
1
2 3 4 5 6 7 8 9 10 |
No
No Yes Yes Yes Yes Yes Yes Yes Yes |
LD F0, 0(R1)
MULT F4, F0, F2 SD 0(R1), F4 SUBI F1, F1, 8 BNEZ R1, loop LD F0, 0(R1) MULT F4, F0,F2 SD 0(R1), F4 SUBI R1, R1, 8 BNEZ R1, loop |
F0
F4 R1 F0 F4 R1 |
Mem[Reg[R1]]
Reg[F0]xReg[F4] #2 R1-8 Mem[Reg[R1]] Reg[F0]xReg[F4] #7 R1-8 |
No
No No No Yes Yes Yes Yes Yes Yes |
Register status (FP)
Field | F0 | F2 | F4 | F6 | F8 | F10 | F12 | ... | F30 |
History | #6 | #7 |
The changes we need to make to figure 4.37:
Instruction status | Wait until | Action or bookkeeping |
Issue | Station and history buffer both available | Change all the Recorder[] to History [] |
Execute | Same | |
Write Result | After filling in the entry for history buffer | Results go directly to destination reg. Keep original copy of reg in history buffer |
Problem 7
SLL R2, RN, 3ADD R2, R2, Ra
SLL R1, RN, 26
SRL R1, R2, 26
ADDI R3, R0, 64
MOVI2S VLR, R3
BEQZ R1, LOOP
MOVI2S VLR,R!
SLL R1, R1, 3
LOOP:LV V1, Ra
DDVS V1, V1, FXThere are 2 memory pipelines, for the 6 vector operations, we can arrange them in 2 convoys.LV V2, Rb
MULTV V1, V1, V2
MULTVS V3, V1, Fy
SV Rc, V
ADD Ra, Ra, R1
ADD Rb, Rb, R1
ADD Rc, Rc, R1
ADDI R1, R0, 512
MOVI2S VLR, R3
SUB R4, R2, Ra
BNZ R4, LOOP
Tchime = 2 We assume Tloop = 15 Rbase = 0
Tstart = 15 + 6 + 15 + 8 + 8 + 15 = 67
T200 = Tbase + ceiling(n/MVL) * (Tloop + Tstart) + n * Tchime
= 0 + ceiling(200/64) * (15 + 67) + 200 * 2 = 728
R200 = 3 * 200 * 400 / 728 = 329.67
Cycles/element = lim 2n + n * (15 + 67) / 64 = +82/64 = 3.28
n->infinite
Rinfinite = 3 * 400 / 3.28 = 365.85
For N1/2: MFLOPS for N1/2 = 365.85/2 = 182.93 = 3 * N1/2/T1/2 =>T1/2 = 6.56 * N1/2
Assuming N1/2 <= 64, so Tn<=64 = 1 * 82 + 2n
-->82 + 2 * N1/2 = 6.56 * N1/2
-->N1/2 = 17.98 = 18