The basic concept of pipelining is to break up instruction execution activities into stages that can operate independently. Every instruction passes through the same stages much like an assembly line.

For example, we could set up the following stages for a MIPS pipeline.

With these pipeline stages, a sequence of instructions can be executed as shown below. Time progresses from left to right. Each horizontal division represents one clock period.

l.s $f0, 0($t1) IF ID EX MEM WB
l.s $f2, 0($t2) IF ID EX MEM WB
mul.s $f4, $f0, $f2 IF ID EX MEM WB
add.s $f6, $f6, $f4 IF ID EX MEM WB
addi $t1, $t1, 4 IF ID EX MEM WB
addi $t2, $t2, 4 IF ID EX MEM WB

As you can see in from the figures below, pipelining increases instruction throughput. Notice that after the 5th cycle, the unpipelined execution completes only one instruction every 5 cycles, while the idealized pipelined execution completes 5.

Ideally, instruction throughput is increased to 1 instruction per clock. In other words, the clocks per instruction (CPI) factor in the performance equation is reduced from 5.0 to 1.0.

Unpipelined Execution
instr1 IF ID EX MEM WB
instr2 IF ID EX MEM WB
instr3 IF ID EX MEM WB

Idealized Pipelined Execution
instr1 IF ID EX MEM WB
instr2 IF ID EX MEM WB
instr3 IF ID EX MEM WB
instr4 IF ID EX MEM WB
instr5 IF ID EX MEM WB
instr6 IF ID EX MEM WB
instr7 IF ID EX MEM WB
instr8 IF ID EX MEM WB
instr9 IF ID EX MEM WB
instr10 IF ID EX MEM WB
instr11 IF ID EX MEM WB

The best starting point for a pipelined implementation is a single-cycle implementation. For example, for a MIPS pipeline you could start with an implementation whose high-level data path is given in Patterson and Hennessey, Figure 4.33.

To implement pipelining registers are added between stages as in Patterson and Hennessey, Figure 4.35. These registers hold data and control signals that are produced in an early stage for use in later stages. Signals generated in a stage cannot be held for more than one cycle. A signal that is generated in an early stage and used several stages later must pass through all of the intermediate pipeline registers. For example, a control signal that is produced in the ID stage and used in the WB stage must pass through 3 pipelining registers: the ID/EX registers, the EX/MEM registers, and the MEM/WB registers. See Patterson and Hennessy, Figure 4.51.

The analogy between a pipeline and an assembly line breaks down in one important respect. Putting together a door for a car does not depend on cars further along in the assembly line. But there are dependences between instructions. These can be seen in Patterson and Hennessey, Figure 4.51 where data is passed back from the WB stage to the ID stage or a new PC value is passed back from the MEM stage to the IF stage. These are referred to as, respectively, data dependences and control dependences. Both of these dependences are inherent in the instruction set. In both cases the execution of a later instruction depends on the results of earlier instruction.

In addition, there are implementation hazards that depend on the starting point for the implementation. For example, if we started with a multicycle implementation, we would have problems in a pipeline because the ALU is used in more than one stage by the same instruction. Executing a branch instruction, the ALU is used to increment the PC, compute a branch target address, and compare two source operands. These uses are going to prevent other instructions in the pipeline from using the ALU. This kind of obstacle is called a structural hazard.

Pipeling is one of the primary reasons why RISC processors have a significant speed advantage over CISC processors. If arithmetic and logical instructions can access memory for source or destination operands then it is much more difficult to break down instruction execution into stages with equal durations. If memory addressing modes are complex then this problem just gets harder. If instructions have varying lengths it is more difficult to start a new instruction every cycle.

When pipelining is done with a CISC processor it is done at a different level. The execution of instructions is broken down into smaller parts which can then be pipelined. In effect, The CISC instructions are translated into a sequence of internal RISC instructions, which are then pipelined. This adds complexity to the processor and generally does not produce as much benefit. For upward compatibility, the Intel 80x86 family of processors, including Pentium processors since the early 1990s, have used this approach.