1. Problem 5.1 from Textbook
From the question we know that Cache A has 128 entries with 2-way associative, while Cache B has 256 entries with direct-mapped scheme.
a) Following is a program which makes Machine A run much faster than Machine B:
b1: beqz r0, b2
; branch if r0=0
b2: beqz r0, b1
; branch if r0=0
here we make two assumptions:
1) the value in r0 is 0
2) locations of b1 and b2
both are mapped to the same set in both caches
The reason is for machine A, the code only causes cache misses when the first time the two instructions is loaded in cache, after that, since it is 2-way associative, all accesses will hit the cache. But for Machine B, all the accesses will cause cache misses for it is a direct-mapped cache, and can only store one block in each set.
b) Following is a program which makes Machine B run much faster than Machine A:
b1: sw
0(r1), r0 ; store r0 to memory
b2: beqz r0,
b1 ; branch if r0=0
here we make two assumptions:
1) the value in r0 is 0
2) locations of b1, b2 and
0(r1) are mapped to different sets within the caches and are all initially
resident
The reason is for machine A, each time sw executes, the data to be stored is written through to memory, thus need additional time, while for machine B, sw operation only updates the cache block sine the cache is write back, the block is not written to main memory until it is replaced.
c) For machine A, if we ignore the overhead, each instruction access
hits the cache, thus, the total time to memory access is n* ha where n
is the number of instructions executed and ha is the cache hit time for
machine A.
For machine B, each instruction will cause a cache
miss, thus the total time to memory access is n* 10hb where 10hb is the
miss time on cache B. If we suppose that ha=hb, then in this situation
machine A runs the code 10 times faster than cache B.
d) Machine B need read access (which hits the cache) to read instructions
and write access (which also hits the cache) to write the data into the
cache. This leads to a total time of 1.5 * n*hb. Machine A also requires
read access (which hits the cache) to read instructions , but it must write
the stored data all the way to main memory because of the "write through"
policy which will takes five times the hit time. Thus, the totally time
is n *(ha + 0.5*5ha)=3.5*n*ha. If we suppose that ha=hb, then machine A
runs 3.5/1.5=2.33 times faster than machine A.
a) We have: Global Miss Rate of L2 Cache = Miss Rate_L1 * Miss
Rate_L2
Global Miss Rate of L2 Cache = 1-99.8%=0.2%
Miss Rate_L1 = [1* 4% + 10% * (15%+8%) ] /
[ 1+15%+8%] = 5.122%
Thus Miss Rate_L2 = 0.2% / 5.122% = 3.905%
Thus the local hit rate of the L2 cache is
1-3.905%=96.095%
b) CPU exec = 0.8
Avg. memory access time per instruction = [ Hit
time_L1+Miss Rate_L1*(Hit time_L2+Miss Rate_L2*MissPenalty_L2) ] * 1.23
We have:
Hit time_L1=0
Miss Rate_L1=5.122%
Hit time_L2= 10 ns
Miss Rate_L2=3.905%
Also, we know that when
there is a cache miss in L2, and if the block is not dirty when we make
replace, then the penalty is 1(address)+4 (8/2 data) + 5 (memory latency)
= 10 bus cycles thus 100 ns. If the block replaced is dirty then the penalty
is doubled, which is 200 ns. Thus, MissPenalty_L2= 100 * 50%+200*50%=150
ns
So, we have Avg. memory access time per instruction=[ 0+5.122%*(10+3.905%*150) ns]*1.23 = 0.999 ns thus CPI mem = 0.999/2 = 0.4995 CPU cycles
Thus, the totally CPI=CPI exec + CPI mem = 0.8 +0.4995=1.2995 cycles
Thus , the memory utilization of the main memory is 0.4995/1.2995=38.44%
c) If the main memory bus were doubled to 128 bits ( 4 words) wide, then we need to re-compute MissPenalty_L2. At this time, when there is a cache miss in L2, and if the block is not dirty when we make replace, then the penalty is 1(address)+2 (8/4 data) + 5 (memory latency) = 8 bus cycles thus 80 ns. If the block replaced is dirty then the penalty is doubled, which is 160 ns. Thus, MissPenalty_L2= 80* 50%+160*50%=120 ns
So, we have Avg. memory access time per instruction=[ 0+5.122%*(10+3.905%*120) ns]*1.23 = 0.925 ns thus CPI mem = 0.925/2 = 0.463 CPU cycles
Thus, the totally CPI=CPI exec + CPI mem = 0.8 +0.463=1.263 cycles
Thus , the memory utilization
of the main memory is 0.463/1.263=36.67%
Also, in this case, the
speedup is 1.2995/1.263 = 1.028
To estimate the performance, we have following formula:
CPU time = IC * Clock Cycle Time * CPI, since here IC and Clock Cycle time is the same, then we can just use CPI as the performance factor.
Also, CPI = CPI exec (naive CPI without memory accesses) + CPI mem ( CPI due to memory stall cycles)
Suppose f-load, f-store, f-other are the frequencies for load, store and other instructions, then we have:
CPI exec = f-load * CPI load + f-store * CPI store + f-other * CPI
other
CPI mem= MemoryStallCycles / Instruction = InstructionFetchStallCycles/Instruction+DataFetchStallCycles/Instruction
= (InstructionFetchMemoryAccess/ Instruction )* InstructionCacheMissRate
* I_Misspenalty +
(DataFetchMemoryAccess /Instruction) * DataFetchCacheMissRate * D_Misspenalty
= 1*0.5% * I_Misspenalty + (f-load + f-store) * 1% * D_Misspenalty
We know that I_Misspenalty = 50 cycles, since the instruction cache
only has read misses and never has write-back traffic.
The D_Misspenalty depends on the cache's write policy, to write through,
D_Misspenalty=50 cycles , but ot write back, D_Misspenalty=50+(50%*50)=75
cycles
Thus, to write through: CPI mem = 1 * 0.5% * 50 + (f-load + f-store)
* 1% * 50 = 0.25 + 0.5 * (f-load + f-store)
While to write back: CPI mem = 0.25 + (f-load + f-store) * 1%
* 75 = 0.25 + 0.75 * (f-load + f-store)
a) In this situation, CPI load=1, CPI store=2, CPI other =1, thus CPI
exec = f-load+2*f-store +f-other
Now, using the statistics in Figure 2.26
on page 105, and the formula we get above, we can have following tables
to summarizes the results (CPI) for both the write through and write back
caches.
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.425 | 1.515 | 0.513 | 1.603 | 1.090 |
b) In this case, the formula for CPI exec is CPI exec = f-load +
f-store + f-other =1.000, the other is the same.
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.425 | 1.425 | 0.513 | 1.513 | 1.000 |
From a) and b) we know that write-through cache is always faster than write-back cache is the case of the problem.
In this problem, we can compute the TLB penalty for one TLB miss: 0.8*12
+4*50 = 209.6 ns
Since one TLB miss will lead to execute 12 instructions with 4 data
cache miss (penalty is 50 ns per miss). And the base CPI is 0.8 while the
clock cycle rate is 1GHz.
The penalty imposed by one regular cache miss is 50 ns
So, the TLB miss penalty for one instruction = 209.6 * 0.01% =0.02096
ns.
While the regular cache miss penalty for one instruction = 1% * 50
+ (26%+9%)*4%*50=1.2 ns if we suppose the load instruction frequency
is 26 % and load is 9% from Figure 2.26.
Thus, through the penalty for one TLB miss is huge compared to one regular
cache miss, since TLB has a low miss rate, its influence to performance
is trivial compared to regular cache miss.
One scenario is if we have 4 blocks totally in the cache, and instruction i is in block 1 which is least recent used, the other three are also occupied by others. For the fully-associative mapping case, when another instruction j comes, it need to replace i out, and then if there is an i instruction following this j, then this lead to a cache miss since i has already beed replaced last time. For the direct-mapping case, when another j comes and if it is mapped to block 4, then block 4 will be replaced, but when next i following this j comes, it will not have a cache miss.
Another scenario I can thought that the fully-associative mapping cache experiences more cache miss than the direct-mapped cache is : Suppose we have following code:
LOOP: Instrs 1
Instrs 2
Instrs 3
.........
Instrs n ; here jump to LOOP
Also we suppose we have totally n-1 blocks in the cache. In a directed-mapping cache, Instrs 1 and Instrs n is mapped to block 1, Instrs 2 is mapped to block 2 and Instrs 3 is mapped to block 3...... thus, for every loop execution, we have only 2 misses
But if we use a fully-associative mapping cache with LRU, Instrs n in
the previous loop need replace Instrs 1 in block 1 out since now block
1 is least-recently used, then in the next loop, we need to replace Instrs
1 in and replace block 2 ( Instrs 2) out since now block 2 is least-recently
used, and then next we need to replace block 3 out ...... this is just
like a chain effect, thus, every time we have a cache miss after the first
loop, namely we always have cache miss. If the loop executed many many
times, then the cache miss is very huge compared to the directed-mapping
cache.
Though nowadays, the capacity of memory capacity
is increasing, but I don't think virtual memory is an idea should be passed.
The first reason is though the memory capacity is increased, but it is
still very expensive, and also, it is impossible to reach a capacity as
huge as the virtual memory nowadays. But this is not the most important
reason, the reason is we still need the advantages brought by virtual memory,
without virtual memory, we can't share protected memory space, can't automatically
managing the memory hierarchy, also, the loading of program for execution
maybe not be relocated , all these are important in a complex and high-performance
system. So, I believe the virtual memory idea will still be very useful
even with a huge memory capacity.
The reason that most computer systems don't take advantage of the extra security available with gates and rings found in a CPU like the Intel Pentium maybe lies on the following point: Generally, customers don't require and use such elaborate protection mechanisms, they don't need such high security level, especially to those distinct personal computer, also, it is mismatch for the simple paging protection of UNIX thus then limits its use.
But in a case that high-security level is most important thing, then this model of protection maybe very useful. For example, with the increasing popularity of Internet, where virtually any machine can become an information provider, and hence almost anyone can access the desktop computer on the web, then the requirement of security level is very high, one want to protect his machine from malicious behavior, at this case, maybe we need this kind of elaborate model of protection. A similar case is the computer or computer network systems for military use which may contain many top secret and confidential and un-classified information, in this case, we need to have many levels of protection for the systems since here the security is the thing we care most. A such case, this model of protection maybe also very useful.
So, gradually, for these specific reasons, part of the computer industry maybe switch to this model of protection.