Time of Instructions

UMBC CMSC 211

Timing of Instructions

To understand the timing of instructions, you have to understand a couple of concepts first. CPUs are rated at a certain MHz (or GHz), that is their internal clock ticks so many times per second. The faster the clock, the less time things will take. Also the first the clock, the more ticks or "machine cycles" there are in a second. Since there are many different models of the x86 family plus many different clock speeds for the same generation, it is easier to measure in machine cycles than fractions of a second. Then when you know the clock speed, you can convert machine cycles to seconds for any model. Well, sort of, since, some instructions use a different number of cycles based on model:

Cycles for the MOV *reg,mem* Instruction
CPU	Clock Cycles
88/86	8+EA (W88=12+EA)
286	5
386	4
486	1

It is difficult to figure how long of an instruction will take! Newer chips have an architecture that is designed to perform certain "tricks" to speed up the performance time of a CPU:

Instructions must reside in memory memory before they can be executed. On modern computers with modern operating systems, not all of the program will be in memory at one time. If the needed instruction or data item is not in memory, the operating system will decide where to put block of the program. If no memory is available, then a block currently in memory will have to be released. Sometimes that block must first be written to the disk. Then the new block is brought into memory.
When the CPU needs to "look at" a piece of a program, the block of code or data may need to be loaded from "slow" memory into "fast" or cache memory. Of course, moving a block into cache will take time!
When the CPU fetches instruction for execution, it actually fetches a whole 32-byte block at a time. This is called pre-fetch, and seems to use 4-5 machine cycles that are then unavailabe for instruction execution. However, then all the instructions in the pre-fetched block can be executed at maximum CPU speed.
The early CPUs went through three steps for each instruction machine cycle:
1. fetch
2. decode
3. execute
This gave us something like this:
```
fetch  | decode | execute | fetch  | decode | execute | 
```
An improvement to this was overlay instructions like this:
```
fetch    | decode | execute | fetch   | decode  | execute  | 
           fetch  | decode  | execute | fetch   | decode   | execute |
	            fetch   | decode  | execute | fetch    | decode  | execute |
```
As you can see, in the early one, 2/3 of the cycle was doing no "effective" work. With this change, you get three times as much work per machine cycle as before with the same exact clock speed.
The CPU executes instructions in an assembly line called a pipeline. Instruction processing is in several stages, and the execute times listed in manuals are not the total time in the pipeline. They are the amount of execution time added to a full pipeline under ideal circumstances.
For instance, if an instruction in the pipeline stores a result that is needed by the next instruction, the later instruction is delayed. Of course, the entire pipeline after a JMP instruction must be flushed and the pipeline reload. This overhead is not included either!
The Pentium CPU allows parallel processing by using multiple pipelines. The original Pentium had two pipelines, called U and V. U was capable of processing all instructions and V was capable of processing selected (relatively simple) insturctions. On occasions where two consecutive instructions could be executed in the two pipelines simultaneously, the second instruction in effect takes no time at all! Later versions of the Pentium seem to have more than two pipelines. Of course extremely carful interlocking by the hardware is necessary to amke sure that the effect of this parallelism gives the same result as if the instructions were executed strictly in sequence.

Knowing how much time an instruction takes can help you optimize your program.

If you are looking at a very small amount of code, it is known as "local optimization" or "peephole optimization". Global optimizaiton is when you look whole loops up to the basic algorithm used to solve the problem.

Peephole optimization does not result in an incredable amount of improvement,unless you are improving a loop which gets executed many, many times. On a 100 MHz Pentium, you might squeeze 1/10,000,000 of a second with each improvement.

Studies show that four percent of the program accounts for more than half of the execution time, and its location was not necessarily evident.

Global optimization is much more useful. A good design is critical to getting the best performance!

It is just as easy to write efficient code as it is inefficient code!

ad hoc rules for local optimization:

Except for mul and div, the execution time is roughly in proportional to the number of bytes in the instruction.
When using registers, it may be faster to use ax or al.
Don't worry about it much, except for loops.

UMBC | CSEE