Solution to HW4 of CMSC611

Solution to HW4 of CMSC611

Problem 1

Dependency:

S2 uses a value computed by S1 in the same iteration;
S3 uses a value computed by S1 in the same iteration;

Actually the value computed by S3 will be replaced by the value computed by next iteration. So S3 is useless.

for (i = 1; i<100; i = i+1)
{ a[i] = b[i] + c[i];
b[i] = a[i] + d[i];
}
a[100] = a[99] + e[99];

Problem 2

MULTD F0, F0, F4
LD F1, -8(R1)
LD F5, - 8(R2)
LD F3, -16(R1)
MULTD F1, F1, F5
ADDD F2, F0, F2
LD F6, -16(R2)
SUB R1, R1, 24
MULTD F3, F3, F6]
ADDD F2, F1, F2
BNEQZ R1, loop
SUB R2, R2, 24
ADDD F2, F2, F3
Speedup = 7/(13/3) = 1.62

Problem 3

State of scoreboard:

Instruction status

Instruction	Issue	Read	Execution	Write
LD F8,0(R3)	Yes	Yes	Yes	Yes
ADDD F6, F4, F2	Yes	Yes	Yes	Yes
SUBD F4, F8, F0	Yes	Yes	Yes	Yes
MULTD F10, F4, F0	Yes
DIVD F12, F8, F2	Yes	Yes
LD F14, 0(R4)	Yes	Yes	Yes	Yes
MULTD F0, F6, F8	Yes

Function units status

	BUSY	OP	Fi	Fj	Fk	Qj	Qk	Rj	Rk
Integer	NO
ADD1	NO
ADD2	NO
MUL1	Yes	MUL	F10	F4	F0	ADD2		Yes	Yes
MUL2	Yes	MUL	F0	F6	F8			Yes	Yes
DIV	Yes	DIV	F12	F8	F2			No	No

Register status

	F0	F2	F4	F6	F8	F10	F12
FU	MUL					MUL	DIV

State of Tomasulo's algorithm:

Instruction status

Instruction	Issue	Execute	Write Result
LD F8, 0(R3)	Yes	Yes	Yes
ADDD F6, F4, F2	Yes	Yes	Yes
SUBD F4, F8, F0	Yes	Yes	Yes
MULTD F10, F4, F0	Yes
DIVD F12, F8, F2	Yes	Yes	Yes
LD F14, 0(R4)	Yes	Yes
MULTD F0, F6, F8	Yes

Reservation stations

Name	Busy	Op	Vj	Vk	Oj	Qk
Integer	No
ADD1	No
ADD 2	No
MUL1	Yes	Mul		Regs[F0]	ADD2
MUL2	No
DIV	Yes	DIV	Regs[F8]	Regs[F2]

Register status

Field	F0	F2	F4	F6	F8	F10	F12
						MUL	DIV

Problem 4

a. For loop2, the percentage of the branch instruction taken is: 4/5 = 80%

the percentage of the branch instruction not taken is: 1/5 = 20%

b. For both loop1 and loop2, the best static branch prediction scheme is predict taken. Percentage of correct prediction is 80%.

c. Predictor was initialized to taken "T":

R4 = Loop2
Prediction
Loop2
Action
New Loop2
Prediction

4
3
2
1
0
T
T
T
T
T
T
T
T
T
NT -x-->
T
T
T
T
NT

4
3
2
1
0
NT
T
T
T
T
T
T
T
T
NT -x-->
T
T
T
T
NT

One-bit predictor for Loop1:

R8=	Loop1 Prediction	Loop1 Action	New Loop1 prediction
7	T	T	T
7	T	T	T

When outer loop is executed n times, the inner loop will executed 5n times. Every iteration, the predictor will predict wrongly 2 times except for the first time. At the first time the inner loop execute, it will predict wrong for 1 time. All the following, it will predict wrong 2 times out of 5 predictions. So the misprediction rate is (2n-1)/5n.The misprediction rate is about 40%.

d. Predictor was initialized to taken "T"(11):

The bits represent:( 00-NT,01-NT,10-T,11-T)

Two-bit predictor for Loop2:

R4=	Loop2 Prediction	Loop2 Action	New Loop2 Prediction
4 3 2 1 0	T(11) T(11) T(11) T(11) T(11)	T T T T NT	T(11) T(11) T(11) T(11) T(10)
4 3 2 1 0	T(10) T(11) T(11) T(11) T(11)	T T T T NT	T(11) T(11) T(11) T(11) T(10)

Two-bit predictor for Loop1:

R8=	Loop1 Prediction	Loop1 Action	New Loop1 Prediction
7 7	T(11) T(11)	T T	T(11) T(11)

If we initialize the predictor to be "11", the misprediction rate is 0.

Problem 5

Branch penalty a:

80% * (1-90%) * 5 + 20% * 3 = 0.4 + 0.6 = 1

Branch penalty b:

60% * 2 + (1-60%) * 0 = 1.2

speedup = (1.2+1.2*15%)/(1.2+1 * 15%) = 1..022

Problem 6

1.Issue:

If there is an empty entry both in reservation buffer and in history buffer, we issue the instruction, otherwise stall it. Sending operands to the reservation station are available in the register file.

2.Execute:

If one or two of the operands is not yet available, monitoring CDB while waiting for register to be computed. This step checks the RAW hazard.

3.Write result:

when result available, write in on the CDB, from there into the reservation station for this result, and write back to register file, before doing it, make a original of this register in the "history buffer" and fill in history buffer fields. If it is a store instruction, put it in the store buffer, don't write the result memory for now.

Reservation stations

Name	Busy	Op	Vj	Vk	Qj	Qk	Dest
Nult1	No	Nult	Mem[Reg[R1]]	[Reg[F2]			#2
Nult2	No	Nult	Mem[Reg[R1]]	Reg[F2]			#7

History buffer

Entry

Busy

Instruction

Value

Speculative

Yes

LD F0, 0(R1)

MULT F4, F0, F2

SD 0(R1), F4

SUBI F1, F1, 8

BNEZ R1, loop

LD F0, 0(R1)

MULT F4, F0,F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZ R1, loop

Mem[Reg[R1]]

Reg[F0]xReg[F4]

R1-8

Mem[Reg[R1]]

Reg[F0]xReg[F4]

R1-8

Yes

Register status (FP)

Field	F0	F2	F4	F6	F8	F10	F12	...	F30
History	#6		#7

The changes we need to make to figure 4.37:

Instruction status Wait until Action or bookkeeping

Issue Station and history buffer both available Change all the Recorder[] to History []

Execute Same

Write Result After filling in the entry for history buffer Results go directly to destination reg. Keep original copy of reg in history buffer

Problem 7

SLL R2, RN, 3
ADD R2, R2, Ra
SLL R1, RN, 26
SRL R1, R2, 26
ADDI R3, R0, 64
MOVI2S VLR, R3
BEQZ R1, LOOP
MOVI2S VLR,R!
SLL R1, R1, 3
LOOP:LV V1, Ra

DDVS V1, V1, FX
LV V2, Rb
MULTV V1, V1, V2
MULTVS V3, V1, Fy
SV Rc, V
ADD Ra, Ra, R1
ADD Rb, Rb, R1
ADD Rc, Rc, R1
ADDI R1, R0, 512
MOVI2S VLR, R3
SUB R4, R2, Ra
BNZ R4, LOOP

There are 2 memory pipelines, for the 6 vector operations, we can arrange them in 2 convoys.

Tchime = 2 We assume Tloop = 15 Rbase = 0

Tstart = 15 + 6 + 15 + 8 + 8 + 15 = 67

T200 = Tbase + ceiling(n/MVL) * (Tloop + Tstart) + n * Tchime

= 0 + ceiling(200/64) * (15 + 67) + 200 * 2 = 728

R200 = 3 * 200 * 400 / 728 = 329.67

Cycles/element = lim 2n + n * (15 + 67) / 64 = +82/64 = 3.28

n->infinite

Rinfinite = 3 * 400 / 3.28 = 365.85

For N1/2: MFLOPS for N1/2 = 365.85/2 = 182.93 = 3 * N1/2/T1/2 =>T1/2 = 6.56 * N1/2

Assuming N1/2 <= 64, so Tn<=64 = 1 * 82 + 2n

-->82 + 2 * N1/2 = 6.56 * N1/2

-->N1/2 = 17.98 = 18

Instruction status	Wait until	Action or bookkeeping
Issue	Station and history buffer both available	Change all the Recorder[] to History []
Execute	Same
Write Result	After filling in the entry for history buffer	Results go directly to destination reg. Keep original copy of reg in history buffer