Register Renaming Performance

The "Instruction" column in the table below shows part of the code from the unrolled dot product loop in the Loop Unrolling web page. This loop code will be used in our first examples.

The "$f8", "$f10", and "$f12" columns show times that the values in the indicated registers are ready.

The "Iss" column shows when an instruction is issued in a renaming architecture.

The "Exec" column shows when an instruction begins its execution phase.

The "Rdy" column shows when the instruction completes its execution phase and its destination operand value is ready. For this example, it is assumed that the execution phase requires 2 cycles for l.d, 4 cycles for add.d, and 20 cycles for mul.d.

The "Comm" column shows when an instruction is committed.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	"	23	"	2	3	23	24
add.d $f8, $f8, $f10	27	"	"	3	23	27	28

Other parts of this web page describe how the columns are calculated and provide some interesting examples. Later these examples are analyzed and the analysis is qualitatively extended to enhancements of the renaming architecture.

Calculating Issue and Execution Times

The "Iss" column shows when an instruction is issued in a renaming architecture. An instruction is issued every cycle as long as issue circuitry can get a value name (rename register number) from the values circuitry. Values circuitry can provide a value name if the number of in-flight instructions plus the number of architectural registers does not exceed the number of rename registers. We will assume that does not happen. The validity of that assumption is discussed in the "Analysis" section.

The "Exec" column shows when an instruction begins its execution phase. An instruction begins its execution phase after it is issued, when both of its source operand values are ready and an appropriate functional unit is available to perform the operation. We will make an idealistic assumption that there is always an available functional unit. Then the "Exec" time is the larger of the following:

the "Iss" time for the instruction, and
the register times for the instruction's two source operands.

The register times can be found by searching up in the columns for the source registers, which are $f10 and $f12 in the example below. This diagram shows a computed "Exec" time in blue and the values that it depends on shown in green.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	?	?	?	2	3	?	?

Calculating Rdy and Register Times

l.d	2
add.d	4
mul.d	20

The "Rdy" time for an instruction is just the sum of its "Exec" time and the appropriate value from the above table.

The "$f8", "$f10", "$f12", and "$f10" columns show times that the values in the indicated registers is ready. These values are left unchanged except for the destination operand of the instruction. Its value will be ready when the instruction completes it execution phase - the same as the time in the "Rdy" column.

The diagram below shows a computed "Rdy" time and a computed register time ($12) in blue. The values that they depend on are shown in green.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	"	23	"	2	3	23	?

Calculating Commit Times I

The "Comm" column shows when an instruction is committed. An instruction is committed when it is at the head of the "in flight" list and its destination operand value is "ready". To find the commit time for an instruction, take the larger of the two following values:

the "Rdy" time for the instruction plus 1, and
the "Comm" time for the previous instruction plus 1.

Adding 1 to the "Rdy" time reflects the fact that commit begins at least one cycle after execution completes. Adding 1 to the "Comm" time assumes that the commit circuitry can only commit one instruction per cycle. The validity of this assumption is discussed in the "Conclusions" section.

The diagram below show a computed "Comm" time in blue. The values that it depends on are shown in green.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	"	23	"	2	3	23	24

Calculating Commit Times II

Here is a later example showing an instruction that is ready quite early, but must wait for the previous instruction to commit.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	"	23	"	2	3	23	24
add.d $f8, $f8, $f10	27	"	"	3	23	27	28
l.d $f10, 8($5)	"	6	"	4	4	6	29

More of the Table

The table below shows four full iterations from the unrolled dot product loop.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	"	23	"	2	3	23	24
add.d $f8, $f8, $f10	27	"	"	3	23	27	28
l.d $f10, 8($5)	"	6	"	4	4	6	29
l.d $f12, 8($6)	"	"	7	5	5	7	30
mul.d $f10, $f10, $f12	"	27	"	6	7	27	31
add.d $f8, $f8, $f10	31	"	"	7	27	31	32
l.d $f10, 16($5)	"	10	"	8	8	10	33
l.d $f12, 16($6)	"	"	11	9	9	11	34
mul.d $f10, $f10, $f12	"	31	"	10	11	31	35
add.d $f8, $f8, $f10	35	"	"	11	31	35	36
l.d $f10, 24($5)	"	14	"	12	12	14	37
l.d $f12, 24($6)	"	"	15	13	13	15	38
mul.d $f10, $f10, $f12	"	35	"	14	15	35	39
add.d $f8, $f8, $f10	39	"	"	15	35	39	40

With 100 loop iterations, fully unrolled, the last add will commit during cycle 424 (28 + 99*4).

More Examples

A register renaming simulator is available here. This zip file can be unzipped to produce a directory named renaming-simulation. This directory contains a jar file named Renaming.jar. Click on this file to run the simulation.

The simulation demonstrates how register renaming performs on the three completely unrolled loops in the Loop Unrolling web page. It has controls for varying instruction latencies and the maximum numbers of instruction issues and commits per cycle.

Here we have two more examples which use the same dot product code as in the original example but modify the operation latencies. The first has a slow multiply. The second has a slow add. The latencies are summarized in the following table.

Example	l.d	mul.d	add.d
First Example	2	20	4
Slow Multiply	2	30	4
Slow Add	2	20	5

Slow Multiply

Here the mul.d execution time is increased from 20 cycles to 30 cycles.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	"	33	"	2	3	33	34
add.d $f8, $f8, $f10	37	"	"	3	33	37	38
l.d $f10, 8($5)	"	6	"	4	4	6	39
l.d $f12, 8($6)	"	"	7	5	5	7	40
mul.d $f10, $f10, $f12	"	37	"	6	7	37	41
add.d $f8, $f8, $f10	41	"	"	7	37	41	42
l.d $f10, 16($5)	"	10	"	8	8	10	43
l.d $f12, 16($6)	"	"	11	9	9	11	44
mul.d $f10, $f10, $f12	"	41	"	10	11	41	45
add.d $f8, $f8, $f10	45	"	"	11	41	45	46
l.d $f10, 24($5)	"	14	"	12	12	14	47
l.d $f12, 24($6)	"	"	15	13	13	15	48
mul.d $f10, $f10, $f12	"	45	"	14	15	45	49
add.d $f8, $f8, $f10	49	"	"	15	45	49	50

With 100 loop iterations, fully unrolled, the last add will commit during cycle 434 (38 + 99*4).

Slow Add

Here the add.d execution time is increased from 4 cycles to 5 cycles.

Instruction	$f8	$f10	$f12	Iss	Exec	Rdy	Comm
	0	0	0	-	-	-	-
l.d $f10, 0($5)	"	2	"	0	0	2	3
l.d $f12, 0($6)	"	"	3	1	1	3	4
mul.d $f10, $f10, $f12	"	23	"	2	3	23	24
add.d $f8, $f8, $f10	28	"	"	3	23	28	29
l.d $f10, 8($5)	"	6	"	4	4	6	30
l.d $f12, 8($6)	"	"	7	5	5	7	31
mul.d $f10, $f10, $f12	"	28	"	6	7	27	32
add.d $f8, $f8, $f10	33	"	"	7	28	33	34
l.d $f10, 16($5)	"	10	"	8	8	10	35
l.d $f12, 16($6)	"	"	11	9	9	11	36
mul.d $f10, $f10, $f12	"	33	"	10	11	31	37
add.d $f8, $f8, $f10	38	"	"	11	33	38	39
l.d $f10, 24($5)	"	14	"	12	12	14	40
l.d $f12, 24($6)	"	"	15	13	13	15	41
mul.d $f10, $f10, $f12	"	38	"	14	15	35	42
add.d $f8, $f8, $f10	43	"	"	15	38	43	44

With 100 loop iterations, fully unrolled, the last add will commit during cycle 524 (29 + 99*5).

Summary

The three examples are summarized in the following table. The unrolled loop is iterated 100 times. The l.d, mul.d, and add.d columns are latencies, in cycles, for the operations. The Total Time column is the total number of cycles until the last instruction is committed.

Example	l.d	mul.d	add.d	Total Time
First Example	2	20	4	424
Slow Multiply	2	30	4	434
Slow Add	2	20	5	524

Note that increasing the multiply latency by 10 only increases the total loop time by 10 but increasing the add latency by 1 increases the total loop time by 100. The add latency has such a big impact because it is involved in a loop-carried dependence.

Single Issue

If control hazards and RAW data hazards are not severe, a register renaming architecture can achieve a CPI near 1.0. This is largely due to elimination of name dependences (WAW and WAR) and having reservation stations that hold enough tasks to keep execution circuitry busy.
The primary limitation of register renaming arises when dependences form chains. Then the latency of the instruction that carries the dependence is critical. The latencies of other instructions have much less impact on performance. Add instructions are the most common carriers of looped dependences.

Multiple Issue

Superscalar execution (multiple issues per cycle) does not help at all when CPI is limited by loop-carried dependences. The latency of the instruction that carries the dependence determines the minimum time for completing an iteration.
Fortunately, the row operation loop does not have any loop-carried dependences. This is one of the most useful loops in scientific programs.

Register Renaming Performance

Register Renaming Performance

Calculations

Calculations

Calculating Issue and Execution Times

Calculating Rdy and Register Times

Calculating Commit Times I

Calculating Commit Times II

More of the Table

More Examples

Dot Product

Slow Multiply

Slow Add

Summary

Analysis

Enhancements

Enhancements

Single Issue

Multiple Issue

Multithreading