1. Assume that the number of instructions is 100 .
DATA REFERENCE : percentage = 35%, total = 35 instructions
BRANCH
: percentage = 19%, total = 19 instructions
ALU operations :
percentage = 46%, total = 46 instructions
a. Allowed offset
lenghts are 0-, 8, and 16-bits including sign bit.
for data reference ,
17% * 35 = 5.95 -----------> offset
of 0
40% * 35 = 14.00 -----------> offset of 8
43% * 35 = 15.05 -----------> offset of 16
for branch instructions,
no instructions have an offset of 0.
93% * 19 = 17.67 -----------> offset of 8
7% * 19 = 1.33 -----------> offset of 16
for ALU instructions,
all the 46 instructions have an offset of 0. they are all 16 bit instructions.
# bits = (46+ 5.95) *16 + (14 + 17.67) * 24 + (15.05 + 1.33) * 32 = 2112.8
#bytes = 2112.8 / 8 = 264.1
#bits / instruction = 2112.8 / 100 = 21.13
b. since all the instructions
have a fixed length of 24 bits, the instructions with 0- and 8-bit offset
will not have any change, but
the 32
bit instructions ( having an offset of 16 bits) will take two 24-bit
instructions.
# bits = (46+ 5.95) *24 + (14 + 17.67) * 24 + (15.05 + 1.33) * 2* 24 =
2790
# bytes = 2790 / 8 = 348.75
a comparison of (a) with (b) is ---------- 348.75
/ 264.1 = 1.32
c. the instructions
are assumed to have a fixed offset of 16 bits. thus all the instructions
which have an offset of 8-bits will now
use 16-bits for the same.
#bits = (46+ 5.95) *16 + (14 + 17.67) * 32 + (15.05 + 1.33) * 32 = 2368.8
#bytes = 2368.8 / 8 = 296.1
a comparison of (c) with (b) gives --------- (c) / (b) = 296.1 / 348.75
= 0.849
2. let the total instruction counts are IC
CPUtime= IC * CPI * Clock cycle
Instruction counts for ALU are:
ALU = 23.8% + 8.3% +1.3% + 7.0% + 9.4% + 4.8% +
2.0% = 56.6%
Instruction counts for Jumps are:
Jumps = 0.5% + 0.4% + 0.5% = 1.4 %
Conditional branches (taken) counts = 15% * 60% =
9%;
conditional branches (not taken) counts = 15% *
40% = 6%;
so, CPUtime(old) = IC* (56.6% * 0.5 + 1.4% * 1.6
+ 20.9% *1.7 + 5.1%*1.0
+ 9% *2.2 + 6%*0.6 )* clock cycle
=1.1155IC/600
Assume x load & store instruction must use the
updating addressing mode for the CPU speed
to remain the same, becuase each updating addressing
mode will decrease add or subtract by
one, so, if we run the same amount of instruction
after updating, the two machines should
consume the same amount of time. we have known the
load & instruction take 26% total counts,
so,
CPIorig/600 =(CPIorig - 0.8* 26%*x)/550
1.1155/600 = (1.1155 -0.8 * 26%*x)/550
x= 44.69%
3. A=X[B]+Y[B]
E=Y[B]+Z[B]
Implementation for acumulator:
--comments---
load B
Mul d
store tmp ;----- tmp = [B]
add X
store oper1 ------oper1=X[B]
load Y
add tmp
store oper2 -------- oper2=Y[B]
load (oper1)
add (oper2)
store A ----------A=X[B]+Y[B]
load Z
add tmp
store oper3
load (oper2) --------oper2=Y[B]
add (oper3) -----oper3=Z[B]
store E ------E=Y[B]+Z[B]
Instructions bytes = 16*3+5=53
Memory data bytes=16*4=64
-----------
Implementation of stack
push B
push d
mul
pop tmp
push tmp
push x
add
pop addrx
push tmp
push Y
add
pop addry
push (addry)
push (addrx)
add
pop A
push tmp
push Z
add
pop addrZ
push (addrZ)
push (addrY)
add
pop E
Instruction bytes fetched =3*17+6+5=62 bytes
Memory data bytes = 4*17=68 bytes
----------------------
Implementataion of load-store
load R1,B
load R2,d
mul R3,R1,R2 ;R3=dis
load R4,x
add R5,R3,R4 ;R5=addrX
load R6, Y
add R7,R3,R6 ;R7=AddrY
load R8,(R5)
load R9,(R7)
add R10,R8,R9 ; =A
load R11,Z
add R12,R11,R3
load R13,(R12)
add R14,R13,R9
store A,R10
store E,R14
Instruction bytes fetched = 3*2+6*4+1*6+5*3=51 bytes
Memory data bytes = 6*4+1*0+3*4+5*0=36 bytes
-----------------------
Implementation of memory to memory
mul dis,B,d
Add addrx, x, dis
add addry,y,dis
add A,(addrx),(addry)
add addrz,Z,dis
add E,(addry),(addrz)
Instruction bytes = 5*7+9=44 bytes
Memory data bytes transfered = 17*4=68 bytes
Since mem-meo architecture has
the smallest number on instruction bytes fetched,
this is most efficient measured
by code size. The load and store has the
smallest of the combined number.
so this is the most efficient architecture in
total memory bandwidth.
4. a. Effective CPI for compress:
ALU instruction: and, sub, compare, load imm, shift, and, or, orther
14.4% + 1.8% + 15.4% + 8.1% + 6.5% + 2.1% + 6.0% + 1.0% = 55.3%
CPI = 0.8*55.3% + 1.7*19.8% + 1.0*5.6% + 2.2*60%*17.4% + 0.6*40%*17.4%
+ 0.6*40%*17.4% + 1.6*(1.5%+0.1+0.1)
= 1.134
b. Effective CPI for hydro2d:
ALU instruction:
10.9% + 0.2% + 1.2% + 0.2% + 2.4% + 0.1% + 5.2% = 20.2%
CPI= 0.8*20.2% + 1.7*(0.1%+24.1%) + 2.2*60%*11.7% + 0.6*40%*11.7%
+ 2.1*(3.6%+7.9%+10.4%) + 2.5*9.4% + 4.8*1.6% + 5.5*0.2% + 1.0*9.9%
= 1.637
c. When FP operations are
replaced by integer instructions, for hydro2d, we can see
from Figure that CPI of integer(0.8) ALU is much less that that of FP(2.1,2.5,4.8).
Then we can expect that average CPI will be bigger after replacement.
5. From the assembly
code, we get distribution of instructions as
follows, where N is number of times that Dhrystone
benchmark runs.
Load
84*N + 174
Store
67*N + 36
Add
70*N + 8
Sub
14*N + 2
Mul
1*N
Div
1*N
Compare
11*N
Load imm
27*N + 20
Cond branch
21*N + 3
Jump
27*N + 69
Call
Return, jmp ind
Shift
15*N + 1
And
6*N
Or
1*N
Others
24*N + 45
When N = 1, we get
Load
84*1 + 174 35.49%
Store
67*1 + 36 14.17%
Add
70*1 + 8 10.73%
Sub
14*1 + 2 2.2%
Mul
1*1
0.138%
Div
1*1
0.138%
Compare
11*1
1.51%
Load imm
27*1 + 20 6.46%
Cond branch
21*1 + 3 3.3%
Jump
27*1 + 69 13.20%
Call
Return, jmp ind
Shift
15*1 + 1 2.2%
And
6*1
0.825%
Or
1*1
0.138%
Others
24*1 + 45 9.49%
When N = 100, we get
Load
84*100 + 174 23.01%
Store
67*100 + 36 18.08%
Add
70*100 + 8 18.81%
Sub
14*100 + 2 3.76%
Mul
1*100 0.268%
Div
1*100 0.268%
Compare
11*100 2.95%
Load imm
27*100 + 20 7.3%
Cond branch
21*100 + 3 5.64%
Jump
27*100 + 69 7.43%
Call
Return, jmp ind
Shift
15*100 + 1 4.03%
And
6*100 1.61%
Or
1*100 0.268%
Others
24*100 + 45 6.56%
When N = 1000, we get
Load
84*1000 + 174 22.79%
Store
67*1000 + 36 18.15%
Add
70*1000 + 8 18.95%
Sub
14*1000 + 2 3.79%
Mul
1*1000 0.271%
Div
1*1000 0.271%
Compare
11*1000 2.98%
Load imm
27*1000 + 20 7.32%
Cond branch
21*1000 + 3 5.69%
Jump
27*1000 + 69 7.33%
Call
Return, jmp ind
Shift
15*1000 + 1 4.06%
And
6*1000 1.62%
Or
1*1000 0.271%
Others
24*1000 + 45 6.51%
When N = 10000, we get
Load
84*10000 + 174 22.77%
Store
67*10000 + 36 18.16%
Add
70*10000 + 8 18.97%
Sub
14*10000 + 2 3.8%
Mul
1*10000 0.271%
Div
1*10000 0.271%
Compare
11*10000 2.98%
Load imm
27*10000 + 20 7.32%
Cond branch
21*10000 + 3 5.69%
Jump
27*10000 + 69 7.32%
Call
Return, jmp ind
Shift
15*10000 + 1 4.06%
And
6*10000 1.63%
Or
1*10000 0.271%
Others
24*10000 + 45 6.50%
From above, we know when
N is big enough, the percentage of each
instruction will almost be constant. When
we compare this to the
instruction frequencies in Figure 2.26, we
find that there is not a
particular program matching Dhrystone benchmark
very well, though in
some sense gcc matches this benchmark. The
reason is that the most
significant parts in Dhrystone benchmark (as
well as in gcc) are
Load, Add and Store instructions. Probably
Dhrystone benchmark is an
integer computation program.