Solution to Homework 5

1. Problem 5.1 from Textbook

From the question we know that Cache A has 128 entries with 2-way associative, while Cache B has 256 entries with direct-mapped scheme.

a) Following is a program which makes Machine A run much faster than Machine B:

        b1: beqz r0, b2    ; branch if r0=0
        b2: beqz r0, b1    ; branch if r0=0

    here we make two assumptions:

        1) the value in r0 is 0
        2) locations of b1 and b2 both are mapped to the same set in both caches

        The reason is for machine A, the code only causes cache misses when the first time the two instructions is loaded in cache, after that, since it is 2-way associative, all accesses will hit the cache. But for Machine B, all the accesses will cause cache misses for it is a direct-mapped cache, and can only store one block in each set.

b) Following is a program which makes Machine B run much faster than Machine A:

        b1: sw     0(r1), r0      ; store r0 to memory
        b2: beqz   r0, b1          ; branch if r0=0

    here we make two assumptions:

        1) the value in r0 is 0
        2) locations of b1, b2 and 0(r1) are mapped to different sets within the caches and are all initially resident

        The reason is for machine A, each time sw executes, the data to be stored is written through to memory, thus need additional time, while for machine B, sw operation only updates the cache block sine the cache is write back, the block is not written to main memory until it is replaced.

c) For machine A, if we ignore the overhead, each instruction access hits the cache, thus, the total time to memory access is n* ha where n is the number of instructions executed and ha is the cache hit time for machine A.
    For machine B, each instruction will cause a cache miss, thus the total time to memory access is n* 10hb where 10hb is the miss time on cache B. If we suppose that ha=hb, then in this situation machine A runs the code 10 times faster than cache B.

d) Machine B need read access (which hits the cache) to read instructions and write access (which also hits the cache) to write the data into the cache. This leads to a total time of 1.5 * n*hb. Machine A also requires read access (which hits the cache) to read instructions , but it must write the stored data all the way to main memory because of the "write through" policy which will takes five times the hit time. Thus, the totally time is n *(ha + 0.5*5ha)=3.5*n*ha. If we suppose that ha=hb, then machine A runs 3.5/1.5=2.33 times faster than machine A.



2. From the problem we know that the L1's I-Cache's Miss Rate is 1-96%=4%, D-Cache's Miss Rate is 1-90%=10%, and the Hit time_L1 can be ignored thus is just zero. Also, one bus cycle is 10 ns and CPU cycle is 2 ns.

a) We have:  Global Miss Rate of L2 Cache = Miss Rate_L1 * Miss Rate_L2
     Global Miss Rate of L2 Cache = 1-99.8%=0.2%
     Miss Rate_L1 = [1* 4% + 10% * (15%+8%) ] / [ 1+15%+8%] = 5.122%
     Thus Miss Rate_L2 = 0.2% / 5.122% = 3.905%
     Thus the local hit rate of the L2 cache is 1-3.905%=96.095%

b) CPU exec = 0.8
    Avg. memory access time per instruction = [ Hit time_L1+Miss Rate_L1*(Hit time_L2+Miss Rate_L2*MissPenalty_L2) ] * 1.23
    We have:
        Hit time_L1=0
        Miss Rate_L1=5.122%
        Hit time_L2= 10 ns
        Miss Rate_L2=3.905%
        Also, we know that when there is a cache miss in L2, and if the block is not dirty when we make replace, then the penalty is 1(address)+4 (8/2 data) + 5 (memory latency) = 10 bus cycles thus 100 ns. If the block replaced is dirty then the penalty is doubled, which is 200 ns. Thus,  MissPenalty_L2= 100 * 50%+200*50%=150 ns

        So, we have Avg. memory access time per instruction=[ 0+5.122%*(10+3.905%*150) ns]*1.23 = 0.999 ns thus CPI mem = 0.999/2 = 0.4995 CPU cycles

        Thus, the totally CPI=CPI exec + CPI mem = 0.8 +0.4995=1.2995 cycles

        Thus , the memory utilization of the main memory is 0.4995/1.2995=38.44%

c) If the main memory bus were doubled to 128 bits ( 4 words) wide, then we need to re-compute MissPenalty_L2. At this time, when there is a cache miss in L2, and if the block is not dirty when we make replace, then the penalty is 1(address)+2 (8/4 data) + 5 (memory latency) = 8 bus cycles thus 80 ns. If the block replaced is dirty then the penalty is doubled, which is 160 ns. Thus,  MissPenalty_L2= 80* 50%+160*50%=120 ns

        So, we have Avg. memory access time per instruction=[ 0+5.122%*(10+3.905%*120) ns]*1.23 = 0.925 ns thus CPI mem = 0.925/2 = 0.463 CPU cycles

        Thus, the totally CPI=CPI exec + CPI mem = 0.8 +0.463=1.263 cycles

        Thus , the memory utilization of the main memory is 0.463/1.263=36.67%
        Also, in this case, the speedup is 1.2995/1.263 = 1.028



3. Problem 5.5 from the text

To estimate the performance, we have following formula:

CPU time = IC * Clock Cycle Time * CPI, since here IC and Clock Cycle time is the same, then we can just use CPI as the performance factor.

Also, CPI = CPI exec (naive CPI without memory accesses) + CPI mem ( CPI due to memory stall cycles)

Suppose f-load, f-store, f-other are the frequencies for load, store and other instructions, then we have:

CPI exec = f-load * CPI load + f-store * CPI store + f-other * CPI other
CPI mem= MemoryStallCycles / Instruction = InstructionFetchStallCycles/Instruction+DataFetchStallCycles/Instruction
               = (InstructionFetchMemoryAccess/ Instruction )* InstructionCacheMissRate * I_Misspenalty +
                  (DataFetchMemoryAccess /Instruction) * DataFetchCacheMissRate * D_Misspenalty
               = 1*0.5% * I_Misspenalty + (f-load + f-store) * 1% * D_Misspenalty

We know that I_Misspenalty = 50 cycles, since the instruction cache only has read misses and never has write-back traffic.
The D_Misspenalty depends on the cache's write policy, to write through, D_Misspenalty=50 cycles , but ot write back, D_Misspenalty=50+(50%*50)=75 cycles

Thus, to write through: CPI mem = 1 * 0.5% * 50 + (f-load + f-store) * 1% * 50 = 0.25 + 0.5 * (f-load + f-store)
While to write back: CPI mem = 0.25 + (f-load + f-store) * 1% * 75 = 0.25 + 0.75 * (f-load + f-store)

a) In this situation, CPI load=1, CPI store=2, CPI other =1, thus CPI exec = f-load+2*f-store +f-other
    Now, using the statistics in Figure 2.26 on page 105, and the formula we get above, we can have following tables to summarizes the results (CPI) for both the write through and write back caches.
 
 
Write Through
Write Back
 
Program
f-load
f-store
f-other
CPI mem
CPI
CPI mem
CPI
CPI exec
Compress
19.8%
5.6%
74.6%
0.377
1.433
0.441
1.497
1.056
Eqntott
30.6%
0.6%
68.8%
0.406
1.412
0.484
1.490
1.006
Espresso
20.9%
5.1%
74.0%
0.380
1.431
0.445
1.496
1.051
Gcc(cc1)
22.8%
14.3%
62.9%
0.436
1.579
0.528
1.671
1.143
li
31.3%
16.7%
52.0%
0.490
1.657
0.610
1.777
1.167
Average
26.0%
9.0%
65.0%
0.425 1.515 0.513 1.603 1.090

b) In this case, the formula for CPI exec is CPI exec = f-load + f-store + f-other =1.000,  the other is the same.
 
 
Write Through
Write Back
 
Program
f-load
f-store
f-other
CPI mem
CPI
CPI mem
CPI
CPI exec
Compress
19.8%
5.6%
74.6%
0.377
1.337
0.441
1.441
1.000
Eqntott
30.6%
0.6%
68.8%
0.406
1.406
0.484
1.484
1.000
Espresso
20.9%
5.1%
74.0%
0.380
1.380
0.445
1.445
1.000
Gcc(cc1)
22.8%
14.3%
62.9%
0.436
1.436
0.528
1.528
1.000
li
31.3%
16.7%
52.0%
0.490
1.490
0.610
1.610
1.000
Average
26.0%
9.0%
65.0%
0.425 1.425 0.513 1.513 1.000

From a) and b) we know that write-through cache is always faster than write-back cache is the case of the problem.



4. Using TLB properly can lead to a fast address translation, thus improve the system improvement.

In this problem, we can compute the TLB penalty for one TLB miss: 0.8*12 +4*50 = 209.6 ns
Since one TLB miss will lead to execute 12 instructions with 4 data cache miss (penalty is 50 ns per miss). And the base CPI is 0.8 while the clock cycle rate is 1GHz.

The penalty imposed by one regular cache miss is 50 ns

So, the TLB miss penalty for one instruction = 209.6 * 0.01% =0.02096 ns.
While the regular cache miss penalty for one instruction = 1% * 50 + (26%+9%)*4%*50=1.2 ns if we suppose the load instruction frequency is 26 % and load is 9% from Figure 2.26.

Thus, through the penalty for one TLB miss is huge compared to one regular cache miss, since TLB has a low miss rate, its influence to performance is trivial compared to regular cache miss.



5. For a small cache system, a cache using direct-mapping could consistently outperform one using fully associative mapping with LRU replacement. One reason is that to a direct-mapping cache, in this case it is take advantage of its own traits such as simple address compare/computation, simple and fast hardware implementation thus with a low overhead, also, in this case, it can avoid the drawbacks in the largest extent, we all know that direct-mapping is not flexible and the map is fixed, thus in a lager cache system, the utilization of cache is low and also the cache hit rate is low, but in a small cache system, the influence caused by this is reduced largely. While on the other hand, to a fully-associative mapping cache with LRU, the complex address and tag comparison thus a lower speed and the overhead and extra space to perform LRU and complex control will make the influence of overhead very big, in a large cache system, this point maybe not very previous, but in a small cache system, this disadvantages will have very bad effects to the performance. Thus, in a very small cache system, maybe the simplest way works best since any big overhead will cause the performance drop dramatically.

One scenario is if we have 4 blocks totally in the cache, and instruction i is in block 1 which is least recent used, the other three are also occupied by others. For the fully-associative mapping case, when another instruction j comes, it need to replace i out, and then if there is an i instruction following this j,  then this lead to a cache miss since i has already beed replaced last time. For the direct-mapping case, when another j comes and if it is mapped to block 4, then block 4 will be replaced, but when next i following this j comes, it will not have a cache miss.

Another  scenario I can thought that the fully-associative mapping cache experiences more cache miss than the direct-mapped cache is : Suppose we have following code:

 LOOP:    Instrs 1

                Instrs 2

                Instrs 3

                .........

                Instrs n    ; here jump to LOOP

Also we suppose we have totally n-1 blocks in the cache. In a directed-mapping cache, Instrs 1 and Instrs n is mapped to block 1, Instrs 2 is mapped to block 2 and Instrs 3 is mapped to block 3...... thus, for every loop execution, we have only 2 misses

But if we use a fully-associative mapping cache with LRU, Instrs n in the previous loop need replace Instrs 1 in block 1 out since now block 1 is least-recently used, then in the next loop, we need to replace Instrs 1 in and replace block 2 ( Instrs 2) out since now block 2 is least-recently used, and then next we need to replace block 3 out ...... this is just like a chain effect, thus, every time we have a cache miss after the first loop, namely we always have cache miss. If the loop executed many many times, then the cache miss is very huge compared to the directed-mapping cache.



6. Problem 5.20 from the textbook
 

     Though nowadays, the capacity of memory capacity is increasing, but I don't think virtual memory is an idea should be passed. The first reason is though the memory capacity is increased, but it is still very expensive, and also, it is impossible to reach a capacity as huge as the virtual memory nowadays. But this is not the most important reason, the reason is we still need the advantages brought by virtual memory, without virtual memory, we can't share protected memory space, can't automatically managing the memory hierarchy, also, the loading of program for execution maybe not be relocated , all these are important in a complex and high-performance system. So, I believe the virtual memory idea will still be very useful even with a huge memory capacity.



7. Problem 5.21 from the textbook

The reason that most computer systems don't take advantage of the extra security available with gates and rings found in a CPU like the Intel Pentium maybe lies on the following point: Generally, customers don't require and use such elaborate protection mechanisms, they don't need such high security level, especially to those distinct personal computer, also, it is mismatch for the simple paging protection of UNIX thus then limits its use.

But in a case that high-security level is most important thing, then this model of protection maybe very useful. For example, with the increasing popularity of Internet, where virtually any machine can become an information provider, and hence almost anyone can access the desktop computer on the web, then the requirement of security level is very high, one want to protect his machine from malicious behavior, at this case, maybe we need this kind of elaborate model of protection. A similar case is the computer or computer network systems for military use which may contain many top secret and confidential and un-classified information, in this case, we need to have many levels of protection for the systems since here the security is the thing we care most. A such case, this model of protection maybe also very useful.

So, gradually, for these specific reasons, part of the computer industry maybe switch to this model of protection.