seconds = instructions × clocks × seconds




program program instruction clock

The Performance Equation

The performance equation analyzes execution time as a product of three factors that are relatively independent of each other.

This equation remains valid if the time units are changed on both sides of the equation. The left-hand side and the factors on the right-hand side are discussed in the following sections.

The three factors are, in order, known as the instruction count (IC), clocks per instruction (CPI), and clock time (CT). CPI is computed as an effective value.

The Performance Equation

The performance equation analyzes execution time as a product of three factors that are relatively independent of each other.

Instruction Count

Computer architects can reduce the instruction count by adding more powerful instructions to the instruction set. However, this can increase either CPI or clock time, or both.

Clocks Per Instruction

Computer architects can reduce CPI by exploiting more instruction-level parallelism. If they add more complex instructions it often increases CPI.

Clock Time

Clock time depends on transistor speed and the complexity of the work done in a single clock. Clock time can be reduced when transistor sizes decrease. However, power consumption increases when clock time is reduced. This increase the amount of heat generated.

Instruction Count

Instruction (IC) count is a dynamic measure: the total number of instruction executions involved in a program. It is dominated by repetitive operations such as loops and recursions.

Instruction count is affected by the power of the instruction set. Different instruction sets may do different amounts of work in a single instruction. CISC processor instructions can often accomplish as much as two or three RISC processor instructions. Some CISC processor instructions have built-in looping so that they can accomplish as much as several hundred RISC instruction executions.

For predicting the effects of incremental changes, architects use execution traces of benchmark programs to get instruction counts. If the incremental change does not change the instruction set then the instruction count normally does not change. If there are small changes in the instruction set then trace information can be used to estimate the change in the instruction count.

For comparison purposes, two machines with different instruction sets can be compared based on compilations of the same high-level language code on the two machines.

Clocks Per Instruction

Clocks per instruction (CPI) is an effective average. It is averaged over all of the instruction executions in a program.

CPI is affected by instruction-level parallelism and by instruction complexity. Without instruction-level parallelism, simple instructions usually take 4 or more cycles to execute. Instructions that execute loops take at least one clock per loop iteration. Pipelining (overlapping execution of instructions) can bring the average for simple instructions down to near 1 clock per instruction. Superscalar pipelining (issuing multiple instructions per cycle) can bring the average down to a fraction of a clock per instruction.

For computing clocks per instruction as an effective average, the cases are categories of instructions, such as branches, loads, and stores. Frequencies for the categories can be extracted from execution traces. Knowledge of how the architecture handles each category yields the clocks per instruction for that category.

Clock Time

Clock time (CT) is the period of the clock that synchronizes the circuits in a processor. It is the reciprocal of the clock frequency.

For example, a 1 GHz processor has a cycle time of 1.0 ns and a 4 GHz processor has a cycle time of 0.25 ns.

Clock time is affected by circuit technology and the complexity of the work done in a single clock. Logic gates do not operate instantly. A gate has a propagation delay that depends on the number of inputs to the gate (fan in) and the number of other inputs connected to the gate's output (fan out). Increasing either the fan in or the fan out slows down the propagation time. Cycle time is set to be the worst-case total propagation time through gates that produce a signal required in the next cycle. The worst-case total propagation time occurs along one or more signal paths through the circuitry. These paths are called critical paths.

For the past 35 years, integrated circuit technology has been greatly affected by a scaling equation that tells how individual transistor dimensions should be altered as the overall dimensions are decreased. The scaling equations predict an increase in speed and a decrease in power consumption per transistor with decreasing size. Technology has improved so that about every 3 years, linear dimensions have decreased by a factor of 2. Transistor power consumption has decreased by a similar factor. Speed increased by a similar factor until about 2005. At that time, power consumption reached the point where air cooling was not sufficient to keep processors cool if the ran at the highest possible clock speed.

Problem Statement

Suppose a program (or a program task) takes 1 billion instructions to execute on a processor running at 2 GHz. Suppose also that 50% of the instructions execute in 3 clock cycles, 30% execute in 4 clock cycles, and 20% execute in 5 clock cycles. What is the execution time for the program or task?

Solution

Suppose a program (or a program task) takes 1 billion instructions to execute on a processor running at 2 GHz. Suppose also that 50% of the instructions execute in 3 clock cycles, 30% execute in 4 clock cycles, and 20% execute in 5 clock cycles. What is the execution time for the program or task?

We have the instruction count: 109 instructions. The clock time can be computed quickly from the clock rate to be 0.5×10-9 seconds. So we only need to to compute clocks per instruction as an effective value:

Value Frequency Product
3 0.5 1.5
4 0.3 1.2
5 0.2 1.0
CPI = 3.7

Then we have

Execution time = 1.0×109 × 3.7 × 0.5×10-9 sec = 1.85 sec.

Problem Statement

Suppose the processor in the previous example is redesigned so that all instructions that initially executed in 5 cycles now execute in 4 cycles. Due to changes in the circuitry, the clock rate has to be decreased from 2.0 GHz to 1.9 GHz. What is the overall percentage improvement?

Solution Form

Suppose the processor in the previous example is redesigned so that all instructions that initially executed in 5 cycles now execute in 4 cycles. Due to changes in the circuitry, the clock rate has to be decreased from 2.0 GHz to 1.9 GHz. No changes are made to the instruction set. What is the overall percentage improvement?

We can determine the percentage improvement quickly by first finding the ratio between before and after performance. The performance equation implies that this ratio will be a product of three factors: a performance ratio for instruction count, a performance ratio for CPI or its reciprocal, instruction throughput, and a performance ratio for clock time or its reciprocal, clock frequency. We can ignore the first factor in this problem since it is 1.0: the instruction count has not changed. We are left with determining the performance ratio for CPI and, since we are given clock frequencies, the performance ratio for clock frequencies.

Solution

Suppose the processor in the previous example is redesigned so that all instructions that initially executed in 5 cycles now execute in 4 cycles. Due to changes in the circuitry, the clock rate has to be decreased from 2.0 GHz to 1.9 GHz. No changes are made to the instruction set. What is the overall percentage improvement?

The performance ratio for frequencies must be less than 1.0: if other factors are the same then a slower clock rate implies worse performance. So this factor of the improvement ratio must be 1.9/2.0.

For the clocks per instruction, we had a value of 3.7 before the change. We compute clocks per instruction after the change as an effective value:

Value Frequency Product
3 0.5 1.5
4 0.3 1.2
4 0.2 0.8
CPI = 3.5

Now, lower clocks per instruction means higher instruction throughput and thus better performance, so we expect this part of the performance ratio to be greater than 1.0; that is, 3.7/3.5.

Then we have

performance ratio
 
=
 
3.7
 
×
 
1.9
 
=
 
7.03
 
 
=
 
1.0043
 



3.5 2.0 7.0

This is a 0.43% improvement, which is probably not worth the effort.