The "Instruction" column in the table below shows part of the code from the unrolled dot product loop in the Loop Unrolling web page. This loop code will be used in our first examples.
The "$f8", "$f10", and "$f12" columns show times that the values in the indicated registers are ready.
The "Iss" column shows when an instruction is issued in a renaming architecture.
The "Exec" column shows when an instruction begins its execution phase.
The "Rdy" column shows when the instruction completes its
execution phase and its destination operand value is ready.
For this example, it is assumed that the execution phase requires
2 cycles for l.d
, 4 cycles for add.d
,
and 20 cycles for mul.d
.
The "Comm" column shows when an instruction is committed.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | " | 23 | " | 2 | 3 | 23 | 24 |
add.d $f8, $f8, $f10 | 27 | " | " | 3 | 23 | 27 | 28 |
Other parts of this web page describe how the columns are calculated and provide some interesting examples. Later these examples are analyzed and the analysis is qualitatively extended to enhancements of the renaming architecture.
Entries in the table are calculated row-by-row. In each row the issue and execution times are computed first. Then the ready and execution times are computed. Finally the commit times are computed.
The "Iss" column shows when an instruction is issued in a renaming architecture. An instruction is issued every cycle as long as issue circuitry can get a value name (rename register number) from the values circuitry. Values circuitry can provide a value name if the number of in-flight instructions plus the number of architectural registers does not exceed the number of rename registers. We will assume that does not happen. The validity of that assumption is discussed in the "Analysis" section.
The "Exec" column shows when an instruction begins its execution phase. An instruction begins its execution phase after it is issued, when both of its source operand values are ready and an appropriate functional unit is available to perform the operation. We will make an idealistic assumption that there is always an available functional unit. Then the "Exec" time is the larger of the following:
The register times can be found by searching up in the columns for the source registers, which are $f10 and $f12 in the example below. This diagram shows a computed "Exec" time in blue and the values that it depends on shown in green.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | ? | ? | ? | 2 | 3 | ? | ? |
The "Rdy" column shows when the instruction completes its execution phase and its destination operand value is ready. For this example, it is assumed that the execution phase requires the following number of cycles.
l.d | 2 |
add.d | 4 |
mul.d | 20 |
The "Rdy" time for an instruction is just the sum of its "Exec" time and the appropriate value from the above table.
The "$f8", "$f10", "$f12", and "$f10" columns show times that the values in the indicated registers is ready. These values are left unchanged except for the destination operand of the instruction. Its value will be ready when the instruction completes it execution phase - the same as the time in the "Rdy" column.
The diagram below shows a computed "Rdy" time and a computed register time ($12) in blue. The values that they depend on are shown in green.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | " | 23 | " | 2 | 3 | 23 | ? |
The "Comm" column shows when an instruction is committed. An instruction is committed when it is at the head of the "in flight" list and its destination operand value is "ready". To find the commit time for an instruction, take the larger of the two following values:
Adding 1 to the "Rdy" time reflects the fact that commit begins at least one cycle after execution completes. Adding 1 to the "Comm" time assumes that the commit circuitry can only commit one instruction per cycle. The validity of this assumption is discussed in the "Conclusions" section.
The diagram below show a computed "Comm" time in blue. The values that it depends on are shown in green.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | " | 23 | " | 2 | 3 | 23 | 24 |
Here is a later example showing an instruction that is ready quite early, but must wait for the previous instruction to commit.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | " | 23 | " | 2 | 3 | 23 | 24 |
add.d $f8, $f8, $f10 | 27 | " | " | 3 | 23 | 27 | 28 |
l.d $f10, 8($5) | " | 6 | " | 4 | 4 | 6 | 29 |
The table below shows four full iterations from the unrolled dot product loop.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | " | 23 | " | 2 | 3 | 23 | 24 |
add.d $f8, $f8, $f10 | 27 | " | " | 3 | 23 | 27 | 28 |
l.d $f10, 8($5) | " | 6 | " | 4 | 4 | 6 | 29 |
l.d $f12, 8($6) | " | " | 7 | 5 | 5 | 7 | 30 |
mul.d $f10, $f10, $f12 | " | 27 | " | 6 | 7 | 27 | 31 |
add.d $f8, $f8, $f10 | 31 | " | " | 7 | 27 | 31 | 32 |
l.d $f10, 16($5) | " | 10 | " | 8 | 8 | 10 | 33 |
l.d $f12, 16($6) | " | " | 11 | 9 | 9 | 11 | 34 |
mul.d $f10, $f10, $f12 | " | 31 | " | 10 | 11 | 31 | 35 |
add.d $f8, $f8, $f10 | 35 | " | " | 11 | 31 | 35 | 36 |
l.d $f10, 24($5) | " | 14 | " | 12 | 12 | 14 | 37 |
l.d $f12, 24($6) | " | " | 15 | 13 | 13 | 15 | 38 |
mul.d $f10, $f10, $f12 | " | 35 | " | 14 | 15 | 35 | 39 |
add.d $f8, $f8, $f10 | 39 | " | " | 15 | 35 | 39 | 40 |
With 100 loop iterations, fully unrolled, the last add will commit during cycle 424 (28 + 99*4).
A register renaming simulator is available
here.
This zip file can be unzipped to produce a directory named
renaming-simulation
.
This directory contains a jar file named Renaming.jar
.
Click on this file to run the simulation.
The simulation demonstrates how register renaming performs on the three completely unrolled loops in the Loop Unrolling web page. It has controls for varying instruction latencies and the maximum numbers of instruction issues and commits per cycle.
Here we have two more examples which use the same dot product code as in the original example but modify the operation latencies. The first has a slow multiply. The second has a slow add. The latencies are summarized in the following table.
Example | l.d | mul.d | add.d |
---|---|---|---|
First Example | 2 | 20 | 4 |
Slow Multiply | 2 | 30 | 4 |
Slow Add | 2 | 20 | 5 |
Here the mul.d
execution time is increased from 20 cycles
to 30 cycles.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | " | 33 | " | 2 | 3 | 33 | 34 |
add.d $f8, $f8, $f10 | 37 | " | " | 3 | 33 | 37 | 38 |
l.d $f10, 8($5) | " | 6 | " | 4 | 4 | 6 | 39 |
l.d $f12, 8($6) | " | " | 7 | 5 | 5 | 7 | 40 |
mul.d $f10, $f10, $f12 | " | 37 | " | 6 | 7 | 37 | 41 |
add.d $f8, $f8, $f10 | 41 | " | " | 7 | 37 | 41 | 42 |
l.d $f10, 16($5) | " | 10 | " | 8 | 8 | 10 | 43 |
l.d $f12, 16($6) | " | " | 11 | 9 | 9 | 11 | 44 |
mul.d $f10, $f10, $f12 | " | 41 | " | 10 | 11 | 41 | 45 |
add.d $f8, $f8, $f10 | 45 | " | " | 11 | 41 | 45 | 46 |
l.d $f10, 24($5) | " | 14 | " | 12 | 12 | 14 | 47 |
l.d $f12, 24($6) | " | " | 15 | 13 | 13 | 15 | 48 |
mul.d $f10, $f10, $f12 | " | 45 | " | 14 | 15 | 45 | 49 |
add.d $f8, $f8, $f10 | 49 | " | " | 15 | 45 | 49 | 50 |
With 100 loop iterations, fully unrolled, the last add will commit during cycle 434 (38 + 99*4).
Here the add.d
execution time is increased from 4 cycles
to 5 cycles.
Instruction | $f8 | $f10 | $f12 | Iss | Exec | Rdy | Comm |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | - | - | - | - | |
l.d $f10, 0($5) | " | 2 | " | 0 | 0 | 2 | 3 |
l.d $f12, 0($6) | " | " | 3 | 1 | 1 | 3 | 4 |
mul.d $f10, $f10, $f12 | " | 23 | " | 2 | 3 | 23 | 24 |
add.d $f8, $f8, $f10 | 28 | " | " | 3 | 23 | 28 | 29 |
l.d $f10, 8($5) | " | 6 | " | 4 | 4 | 6 | 30 |
l.d $f12, 8($6) | " | " | 7 | 5 | 5 | 7 | 31 |
mul.d $f10, $f10, $f12 | " | 28 | " | 6 | 7 | 27 | 32 |
add.d $f8, $f8, $f10 | 33 | " | " | 7 | 28 | 33 | 34 |
l.d $f10, 16($5) | " | 10 | " | 8 | 8 | 10 | 35 |
l.d $f12, 16($6) | " | " | 11 | 9 | 9 | 11 | 36 |
mul.d $f10, $f10, $f12 | " | 33 | " | 10 | 11 | 31 | 37 |
add.d $f8, $f8, $f10 | 38 | " | " | 11 | 33 | 38 | 39 |
l.d $f10, 24($5) | " | 14 | " | 12 | 12 | 14 | 40 |
l.d $f12, 24($6) | " | " | 15 | 13 | 13 | 15 | 41 |
mul.d $f10, $f10, $f12 | " | 38 | " | 14 | 15 | 35 | 42 |
add.d $f8, $f8, $f10 | 43 | " | " | 15 | 38 | 43 | 44 |
With 100 loop iterations, fully unrolled, the last add will commit during cycle 524 (29 + 99*5).
The three examples are summarized in the following table. The unrolled loop is iterated 100 times. The l.d, mul.d, and add.d columns are latencies, in cycles, for the operations. The Total Time column is the total number of cycles until the last instruction is committed.
Example | l.d | mul.d | add.d | Total Time |
---|---|---|---|---|
First Example | 2 | 20 | 4 | 424 |
Slow Multiply | 2 | 30 | 4 | 434 |
Slow Add | 2 | 20 | 5 | 524 |
Note that increasing the multiply latency by 10 only increases the total loop time by 10 but increasing the add latency by 1 increases the total loop time by 100. The add latency has such a big impact because it is involved in a loop-carried dependence.
Register renaming works extremely well on loop-unrolled code that does not contain dependence chains. These dependence chains arise from loop-carried dependences - true dependences from one iteration of the loop to another iteration. Long-running dependences are usually propagated through add operations.
The simplest register renaming architecture just issues one instruction per cycle. In the early 2000s major processors began issuing multiple instructions per cycle. In addition, they added multithreading capabilities.