









## Hardware prefetch

- Prefetching is the act of getting data from memory before it is actually needed by the CPU
  - Typically, the cache requests the next consecutive block to be fetched with a requested block, hopefully avoiding a subsequent miss
  - Compulsory misses reduced by retrieving the data before it is requested
  - Other misses may *increase* => useful blocks replaced in the cache
- Many caches hold prefetched blocks in a special buffer until they are actually needed
  - This buffer is faster than main memory but only has a limited capacity
- Prefetching also uses main memory bandwidth
  - Prefetching works well if the data is actually used
  - However, it can adversely affect performance if the data is rarely used and the accesses interfere with 'demand misses'



8

Chapter 5









## Compiler optimization: loop fusion Many programs have separate loops that operate on the same data Combining these loops allows a program to take advantage of **temporal locality** by grouping operations on the same (cached) data together Caching may work even better because of sequential access between elements Caching can hold results from previous iterations of the loop... for (j=0; j<100; j++) { for (j=0; j<100; j++) x[j] = x[j] + y[j];x[j] = x[j] + y[j];y[j] = y[j] + x[j-1];for (j=0; j<100; j++) { y[j] = y[j] + x[j-1];**UMBC** 9-Apr-00 13 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 5





## Giving read misses priority

- If a system has a write buffer, delay writes to come after reads
- Problem: reads may request a value about to be written
- Solution 1: stall reads until the write buffer is empty
  - The write buffer in write-through is likely to have blocks queued up
  - Read miss penalty increases considerably
- Solution 2: check the write buffer for conflicts
  - In cases like this, the write buffer acts as a victim cache

SW 0(R3),R4 LW R11,4096(R3) LW R12,0(R3) If this is a direct-mapped 4KB cache, will R12 get the value from R4?

Chapter 5



- CPU doesn't wait for the rest of the block!
- Critical word first
  - Don't start the fetch of a block with the first word
  - Instead, fetch the requested word first and then fetch the rest afterwards
- Early restart & critical word first reduce the miss penalty
  - ⇒ CPU can continue execution while most of the block is still being fetched



18





## Desirable characteristics for an L2 cache

- Larger than the L1 cache
  - A miss in L1 is unlikely to be a hit in L2 unless L2 is much larger
  - The local hit rate for L2 depends on the size ratio between L1 and L2!
- Higher associativity
  - The main reason for low associativity was fast, small caches
  - The L2 cache need be neither, and will benefit from the higher hit rate that more blocks per set provides
- Larger block size
  - This reduces compulsory misses that are fetched from main memory
  - Since the L2 cache is large, the effect of increasing conflict misses (as is true for a smaller cache) is minimal

