

- SISD (Single instruction & data stream): uniprocessor
- SIMD (Single instruction stream, multiple data streams)
  - Same instruction is executed by many CPUs on different data streams
  - Each processor has its own data memory
  - Only a single instruction memory and control processor which fetches and dispatches instructions
- MISD (Multiple instruction streams, single data stream)
  - No commercial versions built, but perhaps systolic processors?
- MIMD (Multiple instruction streams, multiple data streams)
  - Each CPU fetches its own instructions and operates on its own data
  - Often built using off-the-shelf microprocessors

- Machines)
- However, less popular today => too expensive to develop
- MIMD model has clearly emerged as the architecture of choice in recent years
  - MIMD offers flexibility
  - Can operate as a single-user machine providing high performance for one application
  - Can operate as multiprogrammed machines running many tasks simultaneously
- MIMDs can build on the cost/performance advantages of offthe-shelf microprocessors & systems

17-Apr-00



# Distributed vs. centralized shared memory

- Distributing the memory among the nodes has two major advantages
  - It's a cost-effective way to scale the memory bandwidth if most accesses are to local memory in the node
  - It reduces the latency for accesses to local memory, due to less contention
- Distributed memory has some disadvantages as well
  - Communicating data between processors becomes more complex
  - Interprocessor communication has higher latency
- Key characteristics that distinguish among distributed memory machines are
  - How communication is performed.
  - The architecture of the distributed memory

# Distributed memory

- Supports larger processor counts by distributing the memory and allowing multiple memories to work in parallel
- Increases in processor bandwidth requirements => distributed memory beats out centralized shared memory for smaller groups of processors



# Distributed memory architecture models

- Physically separate memories can be addressed as one logically shared address space
  - The address space is shared—all processors see the same address space
  - These machines are referred to as *NUMA* (*Non-Uniform Memory Access*) in contrast to the centralized UMA machines
- Multicomputer architecture
  - Multiple private address spaces that are logically disjoint and cannot be addressed by a remote processor
  - $-\,$  An associated communication mechanism used for exchanging data
- For DSM, shared memory can be used to communicate data via load and store operations
- For a multicomputer, communication is done by either synchronous (RPC) or asynchronous message passing

7

Chapter 8

17-Apr-00

Chapter 8



### Hiding communication latency

- How well can the mechanism hide latency by overlapping communication with computation or with other communication?
  - For example, a system that only allows access to a word at a time may have low latency
    - However, it may be unable to hide the latency because each word transferred is treated as a cache miss
  - Another machine may have a higher latency but allow the processor to do other things while waiting for data
- Examples of latency hiding techniques for shared memory will be given later
- Latency hiding is more difficult to measure than the previous two and is application dependent

## Performance metrics for communication

- These performance metrics are affected by
  - The size of the data items being communicated by the application
    ⇒ Size affects the latency and bandwidth in a direct way
  - The effectiveness of the different latency hiding techniques
  - $-\,$  The regularity in the communication patterns.
    - ⇒ These two affect the cost of naming and protection (communication overhead)
- An ideal mechanism would perform well with
  - Large and small data requests.
  - Regular and irregular communication patterns

Chapter 8

11

17-Apr-00

12



17-Apr-00

15

### Cache coherence Cache coherence definitions With multiple caches, one CPU can modify memory at • Coherence defines what values can be returned by a read locations that other CPUs have cached • A memory system is coherent if: For example: - Read after write works for a single processor • If CPU A writes N to location X, all future reads of location X will - CPU A reads location x, getting the value N return N if no other processor writes location X after CPU A - Later, CPU B reads the same location, getting the value N - Other processors' writes eventually propagate. - Next, CPU A writes location x with the value N - 1• If CPU A writes value *N* to location *X*, CPU B will eventually be - At this point, any reads from CPU B will get the value N, while reads able to read value N from location X from CPU A will get the value N - 1• Once it does so, it will continue to read value N until location X is This problem occurs both with write-through caches and (more written again seriously) with write-back caches • This is our intuitive notion of a coherent view of memory Cache coherence (an informal definition): a memory system is coherent if any read of a data item returns the most recently written value of that data item UMBC UMBC 17-Apr-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 8 17 17-Apr-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 8 18 Cache coherence & consistency Cache consistency • Consistency issue: when must a written value be seen by a Coherence: writes to a single location are serialized reader? - If CPUs A and B both write to location X, all processors see the same - This is defined by a memory consistency model order of the writes - For now, assume that a write is not complete until all processors have - This does not mean that all reads must return the same value "seen" the effect of the write • If value *N1* is written "first" to location *X*, followed closely by - Also, assume that a processor may not reorder memory accesses to reads of X and a write of X with value N2, some reads may return move reads before an outstanding write N1 and some N2 • Reads can be reordered, but reads and writes can not be - However, a processor that reads N2 will return N2 for all future reads interchanged Consistency • Coherent caches provide both - This indicates when a modification to memory is seen by other - Replication of shared data items (reduces latency and contention) processors (i.e. will be returned by a read) • Provide multiple copies of data so that several processors can - Clearly, this *can't* be "instantaneous" since it may be that the new access a single piece of memory without serialization value has not even left the processor when a read occurs - Migration of data items (reduces latency) - Issue: when is a write visible to other processors? • Data items are moved from one processor to another as needed 17-Apr-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 17-Apr-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 20 UMBC Chapter 8 19 UMBC Chapter 8

### Cache coherence protocols Small-scale multiprocessor use hardware mechanisms to track the state of data blocks that are shared Two types of protocols - Directory based • The sharing status of a block of physical memory is kept in one location (the directory) • Interprocessor communication is used to maintain coherence - Snooping • The sharing status is distributed and kept with the block in each cache • The caches are usually on a shared memory bus • The cache controllers snoop the bus to watch for transactions that occur on data blocks that they hold UMBC

### Bus snooping protocol: write update

CMSC 611 (Advanced Computer Architecture), Spring 2000

- An alternative is to update all cached copies of the modified data item
  - This is called write update or write broadcast
- To reduce bandwidth requirements, this protocol keeps track of whether or not a word in the cache is shared
  - If not, no broadcast is necessary
- Example again assumes a write-back cache

| Processor<br>activity | Bus<br>activity | Contents of<br>CPU A's cache | Contents of<br>CPU B's cache | Contents of mem location <i>X</i> |
|-----------------------|-----------------|------------------------------|------------------------------|-----------------------------------|
| CPU A reads X         | Cache miss      | 0                            | -                            | 0                                 |
| CPU B reads X         | Cache miss      | 0                            | 0                            | 0                                 |
| CPU A writes 1        | Broadcast       | 1                            | 1                            | 1                                 |
| CPU B reads X         | Cache hit       | 1                            | 1                            | 1                                 |
|                       |                 |                              |                              |                                   |

# Bus snooping protocol: write invalidate

- Write invalidate is the most common protocol, both for snooping and for directory schemes
- The basic ideas behind this protocol:
  - Writes to a location invalidate other caches' copies of the block
  - Reads by other processors on invalidated data cause cache misses
  - If two processors write at the same time, one wins and obtains exclusive access
- Example assumes a write-back cache

| Processor<br>activity | Bus<br>activity |   | Contents of<br>CPU B's cache | Contents of mem location <i>X</i> |
|-----------------------|-----------------|---|------------------------------|-----------------------------------|
| CPU A reads X         | Cache miss      | 0 | -                            | 0                                 |
| CPU B reads X         | Cache miss      | 0 | 0                            | 0                                 |
| CPU A writes 1        | Invalidate      | 1 | -                            | 0                                 |
| CPU B reads X         | Cache miss      | 1 | 1                            | 1                                 |
|                       |                 |   |                              |                                   |

17-Apr-00

UMBC

CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 8

22

## Comparing bus snooping protocols

- Write invalidate is much more popular than write update
- Write update requires more system-wide notifications
  - Multiple writes to the same word with no intervening reads require multiple broadcasts
  - With multiword cache blocks, each word written requires a broadcast
  - Delay between write by one processor and read by another is lower
- Write invalidate uses fewer system-wide notifications
  - The first word written invalidates the entire block
  - Write invalidate works on blocks, while write broadcast works on individual words or bytes
  - Reading an invalidated block causes a miss (somewhat slower)
- Since bus and memory bandwidth is more important in a busbased multiprocessor, write invalidate performs better

17-Apr-00

17-Apr-00

CMSC 611 (Advanced Computer Architecture), Spring 2000

23

Chapter 8

Chapter 8

21

17-Apr-00

Chapter 8

### Implementing the write-invalidate protocol Writes in write-invalidate protocols • Write invalidate is simple in bus-based schemes • Writes are an issue with cache coherence protocols in general - Acquire the bus and broadcast the address to be invalidated • The CPU needs to know if any other caches contain the block - Since all processors snoop the bus, they can check the address against to be written by a processor. items in their cache - If there are none, then the write need not be placed on the bus, reducing Bus acquisition serializes writes to a memory location the time to complete the write and reduces memory bandwidth - Writes to a shared data item cannot complete until the bus is acquired • This can be tracked by adding an *extra state bit* (in addition to How is a data item located when a cache miss occurs? the valid and dirty bits) that indicates if the block is shared - For write-through, it's in memory - If the bit is set (the block is shared), the cache generates an invalidation - For write-back, snooping can be used: if a processor finds that it has a on the bus and marks the block as private dirty copy of the requested cache block, it provides the block instead of - If another processor later requests the block, the miss is snooped and memory the "owner" sets the state bit to shared Write-back caches are greatly preferred in a multiprocessor environment since they reduce memory bandwidth **UMBC** UMBC 17-Apr-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 8 25 17-Apr-00 CMSC 611 (Advanced Computer Architecture), Spring 2000 Chapter 8 Sample bus snooping protocol Optimizations for tag checking Note that every bus transaction checks cache-address tags — Implemented by incorporating a finite state controller in each node - The controller responds to requests from the processor and bus this could potentially interfere with CPU cache access To simplify the controller, write hits and write misses to shared blocks are Reduce interference by treated as write misses Duplicate the tags: bus access proceeds in parallel with CPU This causes processors with copies to invalidate them - On misses, the processor arbitrates for and updates both sets of tags - Snoop also does this to perform an invalidate or to update the shared bit Function Request Source - However, a snoop may require fetching a block, thus stalling Employing a multilevel cache with inclusion Read hit Processor Read data in cache - Snooping is directed to L2, where there are fewer processor accesses Write hit Write data in cache Processor - If a snoop gets a hit, then it must arbitrate for L1 to update state and Read miss Bus Request data from cache or memory possibly retrieve data, usually stalling the processor Request data from cache or memory, Write miss Bus - Since it is popular to use multi-level caches in multiprocessors (to and perform any needed invalidates reduce memory bandwidth), this solution is usually adopted

27

17-Apr-00

Chapter 8



## Snooping protocols: wrapping up

- Protocol assumes that operations are atomic
  - In reality, a write miss is not atomic just too much work to do
  - Also, read misses on a split transaction bus are not atomic
  - Nonatomic actions introduce the possibility of deadlock...
- Real protocols distinguish between write hits and write misses
  - From the shared state, a write miss would require the action shown previously
  - However, a write hit does not require that the data be fetched since it is up-to-date — all that's needed is an invalidate operation
- Real protocols distinguish between shared and clean data in exactly one cache
  - A "clean and private" state eliminates the need to generate a bus transaction on a write to a "clean and private" block

31