## Multiprocessors

- Are we reaching performance limits in uniprocessors?
- Performance enhancements are realized thru improvements in:
  - Architecture
  - Technology
- Panel session at VTS'99: "The end of Moore's Law era?"
  - Three say yes (within 10 years), two say no
  - Jury is still out on this one
  - However, it is generally believed that the physics of the process, e.g. the size of an atom, will impose a hard limit
- With reference to Moore's law:
  - "All exponentials in nature eventually saturate."
  - What is the scaling factor of the x-axis?
  - Where are we today on the curve?

17-Apr-00

CMSC 611 (Advanced Computer Architecture), Spring 2000

Why parallel machines?

- What about improvements in architecture?
  - Uniprocessor improvements reaching a point of diminishing returns!
  - Parallel machines appear to be a natural candidate as a successor to the uniprocessor
- Multiprocessors: cost effective way to improve performance
  - It is unlikely that architectural innovations can be sustained indefinitely
    - ⇒ Analogous to the physical laws that limit technology except in reference to complexity
  - Instead, connect multiple uniprocessors together
  - There has been steady progress on the major obstacle to widespread use of parallel machines => *software*
- Focus on the mainstream of multiprocessor design
  - Machines with small to medium numbers of processors (<100)
  - Viable architectures with more than 100 CPUs are difficult to predict

2

Chapter 8

Chapter 8





- SIMD model was popular through the 80s
  - Examples include the MasPar and Connection Machine (Thinking Machines)
  - However, less popular today => too expensive to develop
- MIMD model has clearly emerged as the architecture of choice in recent years
  - MIMD offers flexibility
  - Can operate as a single-user machine providing high performance for one application
  - Can operate as multiprogrammed machines running many tasks simultaneously
- MIMDs can build on the cost/performance advantages of offthe-shelf microprocessors & systems

4



### Distributed memory

- Supports larger processor counts by distributing the memory and allowing multiple memories to work in parallel
- Increases in processor bandwidth requirements => distributed memory beats out centralized shared memory for smaller groups of processors









### Measuring communication latency

- Lower communication latency is better (of course)
- Communication latency = Sender overhead + Time of flight + Transport latency + Receiver overhead
  - Time of flight is preset
  - Transport latency is determined by interconnection network
  - Sender and receiver overhead are determined by communication mechanism
- Complex mechanisms (i.e. for naming and protection) increase latency, particularly those that require the OS
- Latency affects performance either by
  - Causing the processor to wait
  - Tying up processor resources

Chapter 8

#### Hiding communication latency How well can the mechanism hide latency by overlapping communication with computation or with other communication? - For example, a system that only allows access to a word at a time may have low latency • However, it may be unable to hide the latency because each word transferred is treated as a cache miss - Another machine may have a higher latency but allow the processor to do other things while waiting for data Examples of latency hiding techniques for shared memory will be given later Latency hiding is more difficult to measure than the previous two and is application dependent **UMBC** 11 17-Apr-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 8



- These performance metrics are affected by
  - The size of the data items being communicated by the application  $\Rightarrow$  Size affects the latency and bandwidth in a direct way
  - The effectiveness of the different latency hiding techniques
  - The regularity in the communication patterns.
    - ⇒These two affect the cost of naming and protection (communication overhead)
- An ideal mechanism would perform well with
  - Large and small data requests.
  - Regular and irregular communication patterns



Chapter 8

#### DSM vs. message passing Shared-memory advantages - Compatibility with well-understood mechanisms in centralized SM - Ease of programming, particularly for systems in which communication patterns are complex or vary dynamically during execution - Low overhead for communication: hardware used to enforce protection The ability to use hardware-controlled caching => reduces the frequency of remote communication Message-passing advantages: - Simpler hardware (especially with respect to building coherent caches) - Explicit communication forces programmers and compiler writers to pay attention to what is costly and what is not: is this an advantage? Shared-memory communication is more popular today Centralized schemes still dominate - However, long-term trends favor distributed memory UMBC 17-Apr-00 13 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 8







- The use of large multilevel caches can substantially reduce memory bandwidth demands of a processor
  - This has made it possible for several CPUs to share the same memory through a shared bus
- Caching supports both private and shared data
  - For private data, once cached, its treatment is identical to that of a uniprocessor.
  - For shared data, the shared value may be replicated in many caches
- Replication has several advantages:
  - Reduced latency and memory bandwidth requirements
  - Reduced contention for data items that are read by multiple processors simultaneously
- However, it also introduces a problem: cache coherence

16



- Coherence defines what values can be returned by a read
- A memory system is coherent if:
  - Read after write works for a single processor
    - If CPU A writes *N* to location *X*, **all** future reads of location *X* will return *N* if no other processor writes location *X* after CPU A
  - Other processors' writes eventually propagate.
    - If CPU A writes value *N* to location *X*, CPU B will eventually be able to read value *N* from location *X*
    - Once it does so, it will continue to read value *N* until location *X* is written again
- This is our intuitive notion of a coherent view of memory



- Coherent caches provide both
  - Replication of shared data items (reduces latency and contention)
    - Provide multiple copies of data so that several processors can access a single piece of memory without serialization
  - Migration of data items (reduces latency)
    - Data items are moved from one processor to another as needed

Chapter 8



## Bus snooping protocol: write invalidate

- Write invalidate is the most common protocol, both for snooping and for directory schemes
- The basic ideas behind this protocol:
  - Writes to a location invalidate other caches' copies of the block
  - Reads by other processors on invalidated data cause cache misses
  - If two processors write at the same time, one wins and obtains exclusive access
- Example assumes a write-back cache

| Processor<br>activity | Bus<br>activity |   | Contents of<br>CPU B's cache | Contents of mem location <i>X</i> |
|-----------------------|-----------------|---|------------------------------|-----------------------------------|
| CPU A reads X         | Cache miss      | 0 | -                            | 0                                 |
| CPU B reads X         | Cache miss      | 0 | 0                            | 0                                 |
| CPU A writes 1        | Invalidate      | 1 | -                            | 0                                 |
| CPU B reads X         | Cache miss      | 1 | 1                            | 1                                 |

17-Apr-00



# Bus snooping protocol: write update

- An alternative is to update all cached copies of the modified data item
  - This is called *write update* or *write broadcast*
- To reduce bandwidth requirements, this protocol keeps track of whether or not a word in the cache is shared
  - If not, no broadcast is necessary
- Example again assumes a write-back cache

| Processor<br>activity | Bus<br>activity | Contents of<br>CPU A's cache | 0 0 11 10 11 10 01 | Contents of mem location <i>X</i> |
|-----------------------|-----------------|------------------------------|--------------------|-----------------------------------|
| CPU A reads X         | Cache miss      | 0                            | -                  | 0                                 |
| CPU B reads X         | Cache miss      | 0                            | 0                  | 0                                 |
| CPU A writes 1        | Broadcast       | 1                            | 1                  | 1                                 |
| CPU B reads X         | Cache hit       | 1                            | 1                  | 1                                 |

17-Apr-00

🐘 UMBC

CMSC 611 (Advanced Computer Architecture), Spring 2000

Chapter 8 23

## Comparing bus snooping protocols

- Write invalidate is much more popular than write update
- Write update requires more system-wide notifications
  - Multiple writes to the same word with no intervening reads require multiple broadcasts
  - With multiword cache blocks, each word written requires a broadcast
  - Delay between write by one processor and read by another is lower
- Write invalidate uses fewer system-wide notifications
  - The first word written invalidates the entire block
  - Write invalidate works on blocks, while write broadcast works on individual words or bytes
  - Reading an invalidated block causes a miss (somewhat slower)
- Since bus and memory bandwidth is more important in a busbased multiprocessor, write invalidate performs better

24



### Writes in write-invalidate protocols

- Writes are an issue with cache coherence protocols in general
- The CPU needs to know if any other caches contain the block to be written by a processor.
  - If there are none, then the write need not be placed on the bus, reducing the time to complete the write and reduces memory bandwidth
- This can be tracked by adding an *extra state bit* (in addition to the valid and dirty bits) that indicates if the block is shared
  - If the bit is set (the block is shared), the cache generates an invalidation on the bus and marks the block as private
  - If another processor later requests the block, the miss is snooped and the "owner" sets the state bit to shared

# Optimizations for tag checking

- Note that every bus transaction checks cache-address tags this could potentially interfere with CPU cache access
- Reduce interference by
- Duplicate the tags: bus access proceeds in parallel with CPU
  - On misses, the processor arbitrates for and updates both sets of tags
  - Snoop also does this to perform an invalidate or to update the shared bit
  - However, a snoop may require fetching a block, thus stalling
- Employing a multilevel cache with inclusion
  - Snooping is directed to L2, where there are fewer processor accesses
  - If a snoop gets a hit, then it must arbitrate for L1 to update state and possibly retrieve data, usually stalling the processor
  - Since it is popular to use multi-level caches in multiprocessors (to reduce memory bandwidth), this solution is usually adopted

17-Apr-00

UMBC CMSC 611 (Advanced Computer Architecture), Spring 2000

Chapter 8

27

### Sample bus snooping protocol

- Implemented by incorporating a finite state controller in each node
  - The controller responds to requests from the processor and bus
- To simplify the controller, write hits and write misses to shared blocks are treated as write misses
  - This causes processors with copies to invalidate them

| Request    | Source    | Function                                                                 |
|------------|-----------|--------------------------------------------------------------------------|
| Read hit   | Processor | Read data in cache                                                       |
| Write hit  | Processor | Write data in cache                                                      |
| Read miss  | Bus       | Request data from cache or memory                                        |
| Write miss | Bus       | Request data from cache or memory,<br>and perform any needed invalidates |

17-Apr-00



28



