Loop unrolling is a compiler optimization applied to certain kinds of loops to reduce the frequency of branches and loop maintenance instructions. It is easily applied to sequential array processing loops where the number of iterations is known prior to execution of the loop.

This web presentation first looks at how loop unrolling works. Then it examines loop unrolling applied to three example loops:

All of these examples occur in various types of programs. The row operation loop is often a major component of scientific computations.

The general idea of loop unrolling is to replicate the code inside a loop body a number of times. The number of copies is called the loop unrolling factor. The number of iterations is divided by the loop unrolling factor.

Loop unrolling is most effective when computations involving the loop control variable can be simulated by the compiler. For example, when a loop is stepping sequentially through an array, increments to a register that points to array entries can be simulated by the compiler with changes in the displacements in load and store instructions.

Loop unrolling can reduce the number of loop maintenance instruction executions by the loop unrolling factor. In effect, the computations are done by the compiler rather than being done during program execution.

The loop unrolling factor does not have to exactly divide the number of iterations of the original loop. If the number of iterations is known at compile time then the compiler can add extra iterations after the unrolled loop. Otherwise it can just add a copy of the original loop.

The following C code will compute the sum of the entries in a 100-entry vector A.

    double arraySum = 0;
    for (int i = 0; i < 100; i++) {
      arraySum += A[i];
    }

MIPS assembly language code initializations

The code below omits the loop initializations:

MIPS assembly language code

    loop3:
            l.d     $f10, 0($5)       ; $f10 ← A[i]
            add.d   $f8, $f8, $f10    ; $f8 ← $f8 + A[i]
            addi    $5, $5, 8         ; increment pointer for A[i]
            addi    $7, $7, -1        ; decrement loop count
    test:
            bgtz    $7, loop3         ; Continue if loop count > 0

The addi instructions in this code are loop maintenance: advancing addresses and counting iterations.

This is the same code with loop unrolling with a loop unrolling factor of 4. Notice that the displacements in the loads are incremented by the compiler in the second, third, and fourth copies of the original loop body instructions. The immediate values in the addi instructions have been multiplied by 4 so that the effect of one iteration in the unrolled loop is the same as 4 iterations of the original loop.

Notice that branches and loop maintenance instructions have been reduced by a factor of 4.

    loop3:
            l.d     $f10, 0($5)         ; iteration with displacement 0
            add.d   $f8, $f8, $f10

            l.d     $f10, 8($5)         ; iteration with displacement 8
            add.d   $f8, $f8, $f10

            l.d     $f10, 16($5)        ; iteration with displacement 16
            add.d   $f8, $f8, $f10

            l.d     $f10, 24($5)        ; iteration with displacement 24
            add.d   $f8, $f8, $f10

            addi    $5, $5, 32
            addi    $7, $7, -4
    test:
            bgtz    $7, loop3           ; Continue loop if $7 > 0
      

The following C code will compute the dot product of two 100-entry vectors A and B.

    double dotProduct = 0;
    for (int i = 0; i < 100; i++) {
      dotProduct += A[i]*B[k];
    }

MIPS assembly language code initializations

The code below omits the loop initializations:

MIPS assembly language code

    loop3:
            l.d     $f10, 0($5)       ; $f10 ← A[i]
            l.d     $f12, 0($6)       ; $f12 ← B[i]
            mul.d   $f10, $f10, $f12  ; $f10 ← A[i]*B[i]
            add.d   $f8, $f8, $f10    ; $f8 ← $f8 + A[i]*B[ki]
            addi    $5, $5, 8         ; increment pointer for A[i]
            addi    $6, $6, 8         ; increment pointer for B[i]
            addi    $7, $7, -1        ; decrement loop count
    test:
            bgtz    $7, loop3         ; Continue if loop count > 0

The addi instructions in this code are loop maintenance: advancing addresses and counting iterations.

This is the same code with loop unrolling with a loop unrolling factor of 4.

    loop3:
            l.d     $f10, 0($5)         ; iteration with displacement 0
            l.d     $f12, 0($6)
            mul.d   $f10, $f10, $f12
            add.d   $f8, $f8, $f10

            l.d     $f10, 8($5)         ; iteration with displacement 8
            l.d     $f12, 8($6)
            mul.d   $f10, $f10, $f12
            add.d   $f8, $f8, $f10

            l.d     $f10, 16($5)        ; iteration with displacement 16
            l.d     $f12, 16($6)
            mul.d   $f10, $f10, $f12
            add.d   $f8, $f8, $f10

            l.d     $f10, 24($5)        ; iteration with displacement 24
            l.d     $f12, 24($6)
            mul.d   $f10, $f10, $f12
            add.d   $f8, $f8, $f10

            addi    $5, $5, 32
            addi    $6, $6, 32
            addi    $7, $7, -4
    test:
            bgtz    $7, loop3           ; Continue loop if $7 > 0
      

The following C code will compute the dot product of two 100-entry vectors A and B.

    double dotProduct = 0;
    for (int i = 0; i < 100; i++) {
      dotProduct += A[i]*B[k];
    }

MIPS assembly language code initializations

The code below omits the loop initializations:

MIPS assembly language code

    loop3:
            l.d     $f10, 0($5)       ; $f10 ← A[i]
            mul.d   $f10, $f10, $f0   ; $f10 ← A[i]*C
            l.d     $f12, 0($6)       ; $f12 ← B[i]
            add.d   $f12, $f12, $f10  ; $f12 ← $f12 + A[i]*C
            s.d     $f12, 0($6)       ; B[i] ← B[i] + A[i]*C
            addi    $5, $5, 8         ; increment pointer for A[i]
            addi    $6, $6, 8         ; increment pointer for B[i]
            addi    $7, $7, -1        ; decrement loop count
    test:
            bgtz    $7, loop3         ; Continue if loop count > 0

This is the same code with loop unrolling with a loop unrolling factor of 4.

loop3:
            l.d     $f10, 0($5)         ; iteration with displacement 0
            mul.d   $f10, $f10, $f12
            l.d     $f12, 0($6)
            add.d   $f12, $f12, $f10
            s.d     $f12, 0($6)

            l.d     $f10, 8($5)         ; iteration with displacement 8
            mul.d   $f10, $f10, $f12
            l.d     $f12, 8($6)
            add.d   $f12, $f12, $f10
            s.d     $f12, 8($6)

            l.d     $f10, 16($5)        ; iteration with displacement 16
            mul.d   $f10, $f10, $f12
            l.d     $f12, 16($6)
            add.d   $f12, $f12, $f10
            s.d     $f12, 16($6)

            l.d     $f10, 24($5)        ; iteration with displacement 24
            mul.d   $f10, $f10, $f12
            l.d     $f12, 24($6)
            add.d   $f12, $f12, $f10
            s.d     $f12, 24($6)

            addi    $5, $5, 32
            addi    $6, $6, 32
            addi    $7, $7, -4
    test:
            bgtz    $7, loop3           ; Continue loop if $7 > 0