Register Renaming Performance

The "Instruction" column in the table below shows part of the code from the unrolled dot product loop in the Loop Unrolling web page. This loop code will be used in our first examples.

The "$f8", "$f10", and "$f12" columns show times that the values in the indicated registers are ready.

The "Iss" column shows when an instruction is issued in a renaming architecture.

The "Exec" column shows when an instruction begins its execution phase.

The "Rdy" column shows when the instruction completes its execution phase and its destination operand value is ready. For this example, it is assumed that the execution phase requires 2 cycles for l.d, 4 cycles for add.d, and 20 cycles for mul.d.

The "Comm" column shows when an instruction is committed.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 " 23 " 2 3 23 24
add.d $f8, $f8, $f10 27 " " 3 23 27 28

Other parts of this web page describe how the columns are calculated and provide some interesting examples. Later these examples are analyzed and the analysis is qualitatively extended to enhancements of the renaming architecture.

Calculations

Entries in the table are calculated row-by-row. In each row the issue and execution times are computed first. Then the ready and execution times are computed. Finally the commit times are computed.

Calculating Issue and Execution Times

The "Iss" column shows when an instruction is issued in a renaming architecture. An instruction is issued every cycle as long as issue circuitry can get a value name (rename register number) from the values circuitry. Values circuitry can provide a value name if the number of in-flight instructions plus the number of architectural registers does not exceed the number of rename registers. We will assume that does not happen. The validity of that assumption is discussed in the "Analysis" section.

The "Exec" column shows when an instruction begins its execution phase. An instruction begins its execution phase after it is issued, when both of its source operand values are ready and an appropriate functional unit is available to perform the operation. We will make an idealistic assumption that there is always an available functional unit. Then the "Exec" time is the larger of the following:

The register times can be found by searching up in the columns for the source registers, which are $f10 and $f12 in the example below. This diagram shows a computed "Exec" time in blue and the values that it depends on shown in green.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 ? ? ? 2 3 ? ?

Calculating Rdy and Register Times

The "Rdy" column shows when the instruction completes its execution phase and its destination operand value is ready. For this example, it is assumed that the execution phase requires the following number of cycles.

l.d 2
add.d 4
mul.d 20

The "Rdy" time for an instruction is just the sum of its "Exec" time and the appropriate value from the above table.

The "$f8", "$f10", "$f12", and "$f10" columns show times that the values in the indicated registers is ready. These values are left unchanged except for the destination operand of the instruction. Its value will be ready when the instruction completes it execution phase - the same as the time in the "Rdy" column.

The diagram below shows a computed "Rdy" time and a computed register time ($12) in blue. The values that they depend on are shown in green.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 " 23 " 2 3 23 ?

Calculating Commit Times I

The "Comm" column shows when an instruction is committed. An instruction is committed when it is at the head of the "in flight" list and its destination operand value is "ready". To find the commit time for an instruction, take the larger of the two following values:

Adding 1 to the "Rdy" time reflects the fact that commit begins at least one cycle after execution completes. Adding 1 to the "Comm" time assumes that the commit circuitry can only commit one instruction per cycle. The validity of this assumption is discussed in the "Conclusions" section.

The diagram below show a computed "Comm" time in blue. The values that it depends on are shown in green.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 " 23 " 2 3 23 24

Calculating Commit Times II

Here is a later example showing an instruction that is ready quite early, but must wait for the previous instruction to commit.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 " 23 " 2 3 23 24
add.d $f8, $f8, $f10 27 " " 3 23 27 28
l.d $f10, 8($5) " 6 " 4 4 6 29

More of the Table

The table below shows four full iterations from the unrolled dot product loop.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 " 23 " 2 3 23 24
add.d $f8, $f8, $f10 27 " " 3 23 27 28
l.d $f10, 8($5) " 6 " 4 4 6 29
l.d $f12, 8($6) " " 7 5 5 7 30
mul.d $f10, $f10, $f12 " 27 " 6 7 27 31
add.d $f8, $f8, $f10 31 " " 7 27 31 32
l.d $f10, 16($5) " 10 " 8 8 10 33
l.d $f12, 16($6) " " 11 9 9 11 34
mul.d $f10, $f10, $f12 " 31 " 10 11 31 35
add.d $f8, $f8, $f10 35 " " 11 31 35 36
l.d $f10, 24($5) " 14 " 12 12 14 37
l.d $f12, 24($6) " " 15 13 13 15 38
mul.d $f10, $f10, $f12 " 35 " 14 15 35 39
add.d $f8, $f8, $f10 39 " " 15 35 39 40

With 100 loop iterations, fully unrolled, the last add will commit during cycle 424 (28 + 99*4).

More Examples

A register renaming simulator is available here. This zip file can be unzipped to produce a directory named renaming-simulation. This directory contains a jar file named Renaming.jar. Click on this file to run the simulation.

The simulation demonstrates how register renaming performs on the three completely unrolled loops in the Loop Unrolling web page. It has controls for varying instruction latencies and the maximum numbers of instruction issues and commits per cycle.

Here we have two more examples which use the same dot product code as in the original example but modify the operation latencies. The first has a slow multiply. The second has a slow add. The latencies are summarized in the following table.

Example l.d mul.d add.d
First Example 2 20 4
Slow Multiply 2 30 4
Slow Add 2 20 5

Slow Multiply

Here the mul.d execution time is increased from 20 cycles to 30 cycles.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 " 33 " 2 3 33 34
add.d $f8, $f8, $f10 37 " " 3 33 37 38
l.d $f10, 8($5) " 6 " 4 4 6 39
l.d $f12, 8($6) " " 7 5 5 7 40
mul.d $f10, $f10, $f12 " 37 " 6 7 37 41
add.d $f8, $f8, $f10 41 " " 7 37 41 42
l.d $f10, 16($5) " 10 " 8 8 10 43
l.d $f12, 16($6) " " 11 9 9 11 44
mul.d $f10, $f10, $f12 " 41 " 10 11 41 45
add.d $f8, $f8, $f10 45 " " 11 41 45 46
l.d $f10, 24($5) " 14 " 12 12 14 47
l.d $f12, 24($6) " " 15 13 13 15 48
mul.d $f10, $f10, $f12 " 45 " 14 15 45 49
add.d $f8, $f8, $f10 49 " " 15 45 49 50

With 100 loop iterations, fully unrolled, the last add will commit during cycle 434 (38 + 99*4).

Slow Add

Here the add.d execution time is increased from 4 cycles to 5 cycles.

Instruction $f8 $f10 $f12 Iss Exec Rdy Comm
0 0 0 - - - -
l.d $f10, 0($5) " 2 " 0 0 2 3
l.d $f12, 0($6) " " 3 1 1 3 4
mul.d $f10, $f10, $f12 " 23 " 2 3 23 24
add.d $f8, $f8, $f10 28 " " 3 23 28 29
l.d $f10, 8($5) " 6 " 4 4 6 30
l.d $f12, 8($6) " " 7 5 5 7 31
mul.d $f10, $f10, $f12 " 28 " 6 7 27 32
add.d $f8, $f8, $f10 33 " " 7 28 33 34
l.d $f10, 16($5) " 10 " 8 8 10 35
l.d $f12, 16($6) " " 11 9 9 11 36
mul.d $f10, $f10, $f12 " 33 " 10 11 31 37
add.d $f8, $f8, $f10 38 " " 11 33 38 39
l.d $f10, 24($5) " 14 " 12 12 14 40
l.d $f12, 24($6) " " 15 13 13 15 41
mul.d $f10, $f10, $f12 " 38 " 14 15 35 42
add.d $f8, $f8, $f10 43 " " 15 38 43 44

With 100 loop iterations, fully unrolled, the last add will commit during cycle 524 (29 + 99*5).

Summary

The three examples are summarized in the following table. The unrolled loop is iterated 100 times. The l.d, mul.d, and add.d columns are latencies, in cycles, for the operations. The Total Time column is the total number of cycles until the last instruction is committed.

Example l.d mul.d add.d Total Time
First Example 2 20 4 424
Slow Multiply 2 30 4 434
Slow Add 2 20 5 524

Note that increasing the multiply latency by 10 only increases the total loop time by 10 but increasing the add latency by 1 increases the total loop time by 100. The add latency has such a big impact because it is involved in a loop-carried dependence.

Analysis

Register renaming works extremely well on loop-unrolled code that does not contain dependence chains. These dependence chains arise from loop-carried dependences - true dependences from one iteration of the loop to another iteration. Long-running dependences are usually propagated through add operations.

Enhancements

The simplest register renaming architecture just issues one instruction per cycle. In the early 2000s major processors began issuing multiple instructions per cycle. In addition, they added multithreading capabilities.

Single Issue

Multiple Issue

Multithreading