We can understand data dependences by being more explicit about the action specified by an instruction. The action specified by an instruction is clearer if we describe instructions in terms of values rather than registers. We need to name the values in a way that captures changes in register contents over time.

We can capture the intent of a sequence of instructions by replacing register names in all operands with value names. To do this, we use a table that records the value names assigned to each register name. Then we apply the following algorithm.

1. Replace each source operand by the most recent value name in the designated register column.

2. Replace the destination operand by a new name and place the new name in the designated register column.

It is important that step 1 is done first. Otherwise, when the same register is used both as a source operand and a destination operand we are indicating that the instruction execution phase cannot begin until its result is ready. This makes it impossible for the execution phase to begin.

```      mul.d  \$f6, \$f0, \$f2
div.d  \$f4, \$f2, \$f0
```
Instruction \$f0 \$f2 \$f4 \$f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d \$f6, \$f0, \$f2 V4 mul.d V4, V0, V1
div.d \$f4, \$f2, \$f0 V5 div.d V5, V1, V0

It can be verified that we have correctly renamed these instructions but first we add the following instruction and examine the renaming process in detail.

```      sub.d  \$f2, \$f2, \$f6
```
1. Replace each source operand by the most recent value name in the designated register column.

2. Replace the destination operand by a new name and place the new name in the designated register column.

First replace \$f2 by the most recent value name in the \$f2 column. The table below shows the renamed operand in light blue. The values that it depends on are shown in light green.

Instruction \$f0 \$f2 \$f4 \$f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d \$f6, \$f0, \$f2 V4 mul.d V4, V0, V1
div.d \$f4, \$f2, \$f0 V5 div.d V5, V1, V0
sub.d \$f2, \$f2, \$f6 sub.d ??, V1, ??

Then replace \$f6 by the most recent value name in the \$f6 column.

Instruction \$f0 \$f2 \$f4 \$f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d \$f6, \$f0, \$f2 V4 mul.d V4, V0, V1
div.d \$f4, \$f2, \$f0 V5 div.d V5, V1, V0
sub.d \$f2, \$f2, \$f6 sub.d ??, V1, V4
1. Replace each source operand by the most recent value name in the designated register column.

2. Replace the destination operand by a new name and place the new name in the designated register column.

Replace the \$f2 destination operand by a new value name V7 and also put V7 into the column for \$f2.

Instruction \$f0 \$f2 \$f4 \$f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d \$f6, \$f0, \$f2 V4 mul.d V4, V0, V1
div.d \$f4, \$f2, \$f0 V5 div.d V5, V1, V0
sub.d \$f2, \$f2, \$f6 V7 sub.d V7, V1, V4

Note that the instructions with value names capture the intent of a sequence of instructions by specifying relationships between register values, rather than just registers. This simplifies determining when the execution of an instruction can begin. We only need to check if its source operand values have been computed. The name dependences no longer complicate the picture.

To determine when source operands have been computed, the value registers contain a status bit in addition to a data value. The status bit is initialized to 0 (not ready) when the value register is allocated for an instruction. It is set to 1 when a functional unit writes a result.

An instruction goes through 6 stages in its execution. Each stage can take more than one cycle, but recent processors have reduced the number of cycles per stage in order to reduce power consumption by pipeline registers.

• instruction fetch
The instruction is fetched from memory or an instruction cache.
• issue
The instruction is renamed.
• register fetch
The instruction waits, if necessary, in a reservation station until its source operands have valid status and a suitable functional unit is available.
• execution
When all source operands are ready and a suitable functional unit is available then execution begins. Different operations can have different latencies; that is, different elapsed times for their execution phase.
• write back
When the assigned functional unit completes its execution, it writes the destination operand value to the designated value register.
• commitment
The value register that was previously assigned to the instruction's destination architectural register can be freed for reuse by another instruction.

The instruction is fetched from memory or an instruction cache and the program counter is incremented.

When the instruction is issued it is assigned a value register number, which identifies the instruction and is used as a destination value name. Source operands architectural register numbers are also translated to value names. Issue circuitry forms a recoded instruction which contains a code for the operation and value names for each operand.

From the time they are issued until the time they are committed, instructions are considered to be "in flight". The destination value registers for in flight instructions are linked together in issue order to form an "in flight" list in the values circuitry.

If both source operands in the recoded instruction have "ready" status, and a functional unit is available to perform the operation, then the recoded instruction can be sent immediately to the functional unit. Otherwise, it is assigned to a reservation slot to wait for source operands to become available and/or for a suitable functional unit to become available.

When all source operands are ready and a suitable functional unit is available then execution begins. The functional unit receives an operation code and the destination value name from a reservation station slot. The source operands can either be retrieved by the reservation station or by the functional unit, depending on the implementation.

The destination operand value and its name are sent to the appropriate register in the values circuitry. The status bit for the named value register is updated to indicate that the value is ready.

If register forwarding is implemented the value and value name are also sent to reservation stations. A reservation station that is waiting for that value name can then capture the value. If other source operands are ready and a functional unit is available it instruction can start execution at the beginning of the next cycle. This is one cycle earlier than waiting for the value register to have a valid value.

The instruction begins commitment when two conditions are met:

• its destination value register is at the front of the "in flight" list, and
• its destination value register has "ready" status.

At that time, the value register that was previously assigned to the instruction's destination architectural register can be freed for reuse by another instruction.

The following block diagram and block descriptions are one of several possible implementations of register renaming. The description is not intended to represent how any particular processor implements register renaming. It is mainly intended to make implementation believable. For more information about design options see [Sima00].

• Instruction fetch circuitry - fetches instructions from memory or an instruction cache.
• Issue circuitry - renames instructions.
• Reservation circuitry - holds renamed instructions until they are ready to be executed.
• Execution circuitry - performs operations required for instructions.
• Values circuitry - manages a numbered set of value registers. The register numbers are used by issue circuitry for renaming operands.

In the literature the value registers are usually called rename registers.

• Commit circuitry - frees value registers that are no longer needed.

Modern processors that have a renaming architecture have an organization similar to that described in the previous sections. There are some important changes for improving performance.

• Speculative execution
• Super-scalar execution

Speculative execution involves predicting whether or not conditional branches will be taken and speculative issue of either the instruction at the branch target or the instruction following the branch, depending on the prediction. With speculative execution CPI can be reduced to nearly 1.0.

Speculative execution requires adding circuitry to the instruction fetch circuitry for keeping track of past history of branches. Usually a cache for recent target addresses is added to reduce the time spent on computing branch target addresses. Speculative execution also requires small modifications elsewhere to allow cancellation of instructions that were issued as a result of incorrect predictions.

Super-scalar execution involves issuing multiple instructions in each cycle. For the most part, this uses the same basic architectural components, with widening of paths from the instruction cache to support more instruction fetches and some addition to the issue circuitry to support parallel issue. Modern processors are currently capable of a CPI of issuing in batches of 4 or 5 instructions each cycle. It is difficult to increase the size of an issue batch beyond this point due to the inherent sequential nature of renaming.

Processors usually cannot issue a full batch of instructions on all cycles. Most implementations will not make any attempt to include instructions after a branch in the same issue batch. Instead, the instructions after the branch are issued the following cycle. This results in an average number of issues per cycle of about 2.5, giving a CPI of about 0.4.

Modern operating systems support time sharing between multiple threads, increasing the utilization of the processor. For example, a web server will typically have one thread that just listens for new connection requests. After establishing a connection, that thread passes it to a worker thread. Each worker thread handles only one connection at a time. If one of the worker threads is blocked, for example waiting for information from the file system, other threads can get time on the processor.

Recent processors have added hardware support for multiple threads. The most important hardware change is replication of the issue table - each hardware thread needs it own table. Any hardware fields that record architectural register numbers also have to be expanded to allow including a thread identifier.

Multi-threaded execution can be used to simplify and improve handling of branches. Suppose you had a processor that could issue 4 instructions per cycle and supported 6 threads. Then it has only 4 of the threads active at a time. If a branch instruction is fetched for one of the threads it can be issued, but then the thread made inactive until the branch condition is resolved. That allows the processor to keep issuing 4 instructions per cycle except when at least 3 threads are inactive due to branches.

It is not clear how useful this idea is. Currently, processor implementation is up against a power wall. Increasing the amount of work that gets done per cycle just increases the power consumption and the heat that results.

The first renaming architecture was used for floating point instructions in late high performance models of the IBM 360 family of of computers [Tomasulo]. At that time it was not called register renaming. Tomasulo's architecture did not have value registers. Instead it used reservation station numbers for naming values. Each reservation slot had space for two source operand values. The reservation stations had circuitry to update the source operands when functional units produced result values.

The term "register renaming" was used by compiler writers for a technique that they used for code optimization. The term was later adopted by computer engineers.

Register renaming was used in many processors starting in the mid 1990s. Today it is used in almost every desktop and server processor.