We can understand data dependences by being more explicit about the action specified by an instruction. The action specified by an instruction is clearer if we describe instructions in terms of values rather than registers. We need to name the values in a way that captures changes in register contents over time.

We can capture the intent of a sequence of instructions by replacing register names in all operands with value names. To do this, we use a table that records the value names assigned to each register name. Then we apply the following algorithm.

  1. Replace each source operand by the most recent value name in the designated register column.

  2. Replace the destination operand by a new name and place the new name in the designated register column.

It is important that step 1 is done first. Otherwise, when the same register is used both as a source operand and a destination operand we are indicating that the instruction execution phase cannot begin until its result is ready. This makes it impossible for the execution phase to begin.

We will start with the following instructions.

      mul.d  $f6, $f0, $f2
      div.d  $f4, $f2, $f0
      add.d  $f0, $f6, $f2
      
Instruction $f0 $f2 $f4 $f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d $f6, $f0, $f2 V4 mul.d V4, V0, V1
div.d $f4, $f2, $f0 V5 div.d V5, V1, V0
add.d $f0, $f6, $f2 V6 add.d V6, V4, V1

It can be verified that we have correctly renamed these instructions but first we add the following instruction and examine the renaming process in detail.

      sub.d  $f2, $f2, $f6
      
  1. Replace each source operand by the most recent value name in the designated register column.

  2. Replace the destination operand by a new name and place the new name in the designated register column.

First replace $f2 by the most recent value name in the $f2 column. The table below shows the renamed operand in light blue. The values that it depends on are shown in light green.

Instruction $f0 $f2 $f4 $f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d $f6, $f0, $f2 V4 mul.d V4, V0, V1
div.d $f4, $f2, $f0 V5 div.d V5, V1, V0
add.d $f0, $f6, $f2 V6 add.d V6, V4, V1
sub.d $f2, $f2, $f6 sub.d ??, V1, ??

Then replace $f6 by the most recent value name in the $f6 column.

Instruction $f0 $f2 $f4 $f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d $f6, $f0, $f2 V4 mul.d V4, V0, V1
div.d $f4, $f2, $f0 V5 div.d V5, V1, V0
add.d $f0, $f6, $f2 V6 add.d V6, V4, V1
sub.d $f2, $f2, $f6 sub.d ??, V1, V4
  1. Replace each source operand by the most recent value name in the designated register column.

  2. Replace the destination operand by a new name and place the new name in the designated register column.

Replace the $f2 destination operand by a new value name V7 and also put V7 into the column for $f2.

Instruction $f0 $f2 $f4 $f6 Renamed Instruction
initial values V0 V1 V2 V3
mul.d $f6, $f0, $f2 V4 mul.d V4, V0, V1
div.d $f4, $f2, $f0 V5 div.d V5, V1, V0
add.d $f0, $f6, $f2 V6 add.d V6, V4, V1
sub.d $f2, $f2, $f6 V7 sub.d V7, V1, V4

Note that the instructions with value names capture the intent of a sequence of instructions by specifying relationships between register values, rather than just registers. This simplifies determining when the execution of an instruction can begin. We only need to check if its source operand values have been computed. The name dependences no longer complicate the picture.

To determine when source operands have been computed, the value registers contain a status bit in addition to a data value. The status bit is initialized to 0 (not ready) when the value register is allocated for an instruction. It is set to 1 when a functional unit writes a result.

An instruction goes through 6 stages in its execution. Each stage can take more than one cycle, but recent processors have reduced the number of cycles per stage in order to reduce power consumption by pipeline registers.

The instruction is fetched from memory or an instruction cache and the program counter is incremented.

When the instruction is issued it is assigned a value register number, which identifies the instruction and is used as a destination value name. Source operands architectural register numbers are also translated to value names. Issue circuitry forms a recoded instruction which contains a code for the operation and value names for each operand.

From the time they are issued until the time they are committed, instructions are considered to be "in flight". The destination value registers for in flight instructions are linked together in issue order to form an "in flight" list in the values circuitry.

If both source operands in the recoded instruction have "ready" status, and a functional unit is available to perform the operation, then the recoded instruction can be sent immediately to the functional unit. Otherwise, it is assigned to a reservation slot to wait for source operands to become available and/or for a suitable functional unit to become available.

When all source operands are ready and a suitable functional unit is available then execution begins. The functional unit receives an operation code and the destination value name from a reservation station slot. The source operands can either be retrieved by the reservation station or by the functional unit, depending on the implementation.

The destination operand value and its name are sent to the appropriate register in the values circuitry. The status bit for the named value register is updated to indicate that the value is ready.

If register forwarding is implemented the value and value name are also sent to reservation stations. A reservation station that is waiting for that value name can then capture the value. If other source operands are ready and a functional unit is available it instruction can start execution at the beginning of the next cycle. This is one cycle earlier than waiting for the value register to have a valid value.

The instruction begins commitment when two conditions are met:

At that time, the value register that was previously assigned to the instruction's destination architectural register can be freed for reuse by another instruction.

The following block diagram and block descriptions are one of several possible implementations of register renaming. The description is not intended to represent how any particular processor implements register renaming. It is mainly intended to make implementation believable. For more information about design options see [Sima00].

Instruction fetch circuitry fetches instructions from memory or an instruction cache, incrementing the program counter after each fetch. Instruction fetch circuitry often also has a branch history table for predicting whether or not conditional branches will be taken and a branch target buffer for quick lookup of branch targets.

Issue circuitry renames instruction operands, sending the recoded instruction to the reservation stations. It uses an issue table - a small set of registers, one for each architectural register. This table is essentially hardware for the register columns in the earlier textual renaming tables. The values in the table are value register numbers obtained from the values circuitry.

In some variants of the renaming architecture, issue circuitry may also fetch source values if they are available.

Reservation circuitry holds recoded instructions until they are ready to be executed. An instruction is ready when all of its source operands are valid and there is a suitable free functional unit to perform the operation.

Execution circuitry performs operations required for instructions. Execution circuitry is usually grouped into functional units, each performing a group of operations that involve shared circuity. Most processors, for example, use a load-store functional unit for all memory load and store instructions. For high performance, the functional units are often pipelined to permit starting the execution phase of a new instruction every cycle.

Values circuitry manages a numbered set of value registers. The register numbers are used by issue circuitry for renaming operands. In addition to its value, a value register has a status bit that indicates whether or not its value is valid, a link field for building lists, and a field indicating which architectural register it represents.

Values circuitry uses the link field to build to the following lists.

Values circuitry also has small registers for keeping track of the start and end of these lists. The start of the free list is a value register number that can be passed to the issue circuitry for naming new values. The start of the in-flight list is the register number for the value register that is to be committed next. When necessary, value registers can be added to the end of these lists.

Commit circuitry frees value registers that are no longer needed. A value register is no longer needed when its architectural register gets assigned a new value. Commit circuitry can be considered a part of the values circuitry since it has access to the free and in-flight lists. In addition, commit circuitry maintains a commit table that maps architectural registers to the value registers that contain their most recently committed values. This table is similar to the issue table in the issue circuitry.

Commit circuitry works with the value register at the head of the in-flight list. The "architectural register" field of that register is used to select an entry from the commit table. This entry specifies the number of the value register that was committed earlier to the architectural register. That value is no longer needed so its register is moved into the free list. Meanwhile the value register at the head head of the in-flight list is removed from the in-flight list and its value register number replaces the old entry in the commit table.

The operation of commit circuitry can be improved by maintaining separate lists for each architectural register. That allows multiple instructions to be committed per cycle. However, it only works correctly if reservation stations capture source operand values as soon as they are ready. The importance of multiple commits per cycle is brought out in Register Renaming Performance.

Modern processors that have a renaming architecture have an organization similar to that described in the previous sections. There are some important changes for improving performance.

Speculative execution involves predicting whether or not conditional branches will be taken and speculative issue of either the instruction at the branch target or the instruction following the branch, depending on the prediction. With speculative execution CPI can be reduced to nearly 1.0.

Speculative execution requires adding circuitry to the instruction fetch circuitry for keeping track of past history of branches. Usually a cache for recent target addresses is added to reduce the time spent on computing branch target addresses. Speculative execution also requires small modifications elsewhere to allow cancellation of instructions that were issued as a result of incorrect predictions.

Super-scalar execution involves issuing multiple instructions in each cycle. For the most part, this uses the same basic architectural components, with widening of paths from the instruction cache to support more instruction fetches and some addition to the issue circuitry to support parallel issue. Modern processors are currently capable of a CPI of issuing in batches of 4 or 5 instructions each cycle. It is difficult to increase the size of an issue batch beyond this point due to the inherent sequential nature of renaming.

Processors usually cannot issue a full batch of instructions on all cycles. Most implementations will not make any attempt to include instructions after a branch in the same issue batch. Instead, the instructions after the branch are issued the following cycle. This results in an average number of issues per cycle of about 2.5, giving a CPI of about 0.4.

Modern operating systems support time sharing between multiple threads, increasing the utilization of the processor. For example, a web server will typically have one thread that just listens for new connection requests. After establishing a connection, that thread passes it to a worker thread. Each worker thread handles only one connection at a time. If one of the worker threads is blocked, for example waiting for information from the file system, other threads can get time on the processor.

Recent processors have added hardware support for multiple threads. The most important hardware change is replication of the issue table - each hardware thread needs it own table. Any hardware fields that record architectural register numbers also have to be expanded to allow including a thread identifier.

Multi-threaded execution can be used to simplify and improve handling of branches. Suppose you had a processor that could issue 4 instructions per cycle and supported 6 threads. Then it has only 4 of the threads active at a time. If a branch instruction is fetched for one of the threads it can be issued, but then the thread made inactive until the branch condition is resolved. That allows the processor to keep issuing 4 instructions per cycle except when at least 3 threads are inactive due to branches.

It is not clear how useful this idea is. Currently, processor implementation is up against a power wall. Increasing the amount of work that gets done per cycle just increases the power consumption and the heat that results.

The first renaming architecture was used for floating point instructions in late high performance models of the IBM 360 family of of computers [Tomasulo]. At that time it was not called register renaming. Tomasulo's architecture did not have value registers. Instead it used reservation station numbers for naming values. Each reservation slot had space for two source operand values. The reservation stations had circuitry to update the source operands when functional units produced result values.

The term "register renaming" was used by compiler writers for a technique that they used for code optimization. The term was later adopted by computer engineers.

Register renaming was used in many processors starting in the mid 1990s. Today it is used in almost every desktop and server processor.