Timing of Instructions
To understand the timing of instructions, you have to understand
a couple of concepts first. CPUs are rated at a certain MHz (or GHz),
that is their internal clock ticks so many times per second. The faster
the clock, the less time things will take. Also the first the clock,
the more ticks or "machine cycles" there are in a second. Since there
are many different models of the x86 family plus many different clock
speeds for the same generation, it is easier to measure in machine
cycles than fractions of a second. Then when you know the clock
speed, you can convert machine cycles to seconds for any model.
Well, sort of, since, some instructions use a different number of
cycles based on model:
Cycles for the MOV reg,mem Instruction
CPU | Clock Cycles |
88/86 | 8+EA (W88=12+EA) |
286 | 5 |
386 | 4 |
486 | 1 |
It is difficult to figure how long of an instruction will take!
Newer chips have an architecture that is designed to perform certain
"tricks" to speed up the performance time of a CPU:
- Instructions must reside in memory memory before they can
be executed. On modern computers with modern operating systems, not
all of the program will be in memory at one time. If the needed
instruction or data item is not in memory, the operating system will
decide where to put block of the program. If no memory is available,
then a block currently in memory will have to be released. Sometimes
that block must first be written to the disk. Then the new block is
brought into memory.
- When the CPU needs to "look at" a piece of a program, the block of
code or data may need to be loaded from "slow" memory into "fast" or cache
memory. Of course, moving a block into cache will take time!
- When the CPU fetches instruction for execution, it actually fetches a
whole 32-byte block at a time. This is called pre-fetch, and seems to
use 4-5 machine cycles that are then unavailabe for instruction execution. However,
then all the instructions in the pre-fetched block can be executed
at maximum CPU speed.
The early CPUs went through three steps for each instruction machine cycle:
- fetch
- decode
- execute
This gave us something like this:
fetch | decode | execute | fetch | decode | execute |
An improvement to this was overlay instructions like this:
fetch | decode | execute | fetch | decode | execute |
fetch | decode | execute | fetch | decode | execute |
fetch | decode | execute | fetch | decode | execute |
As you can see, in the early one, 2/3 of the cycle was doing no "effective" work. With
this change, you get three times as much work per machine cycle as before with
the same exact clock speed.
- The CPU executes instructions in an assembly line called a pipeline.
Instruction processing is in several stages, and the execute times listed in manuals
are not the total time in the pipeline. They are the amount of execution time added to a full
pipeline under ideal circumstances.
For instance, if an instruction in the
pipeline stores a result that is needed by the next instruction, the later instruction is
delayed. Of course, the entire pipeline after a JMP instruction must be flushed and
the pipeline reload. This overhead is not included either!
- The Pentium CPU allows parallel processing by using multiple pipelines. The original
Pentium had two pipelines, called U and V. U was capable of processing all instructions and V
was capable of processing selected (relatively simple) insturctions. On occasions
where two consecutive instructions could be executed in the two pipelines simultaneously, the second
instruction in effect takes no time at all! Later versions of the Pentium seem to have more than
two pipelines. Of course extremely carful interlocking by the hardware is necessary to amke sure that the
effect of this parallelism gives the same result as if the instructions were executed strictly in sequence.
Knowing how much time an instruction takes can help you
optimize your program.
If you are looking at a very small amount of code, it is known as
"local optimization" or "peephole optimization". Global
optimizaiton is when you look whole loops up to the basic
algorithm used to solve the problem.
Peephole optimization does not result in an incredable amount of
improvement,unless you are improving a loop which gets executed many,
many times. On a 100 MHz Pentium, you might squeeze 1/10,000,000
of a second with each improvement.
Studies show that four percent of the program accounts for more than
half of the execution time, and its location was not necessarily
evident.
Global optimization is much more useful. A good design is critical to
getting the best performance!
It is just as easy to write efficient code as it is inefficient code!
ad hoc rules for local optimization:
- Except for mul and div, the execution time is
roughly in proportional to the number of bytes in the
instruction.
- When using registers, it may be faster to use ax or
al.
- Don't worry about it much, except for loops.
UMBC |
CSEE