Advantages
The overhead in "tight" loops often consists of instructions to increment a pointer or index to the next element in an array ( pointer arithmetic), as well as "end of loop" tests. If an optimizing compiler or assembler is able to pre-calculate offsets to each ''individually referenced'' array variable, these can be built into theDisadvantages
* Increased program code size, which can be undesirable, particularly for embedded applications. Can also cause an increase in instruction cache misses, which may adversely affect performance. * Unless performed transparently by an optimizing compiler, the code may become less readable. * If the code in the body of the loop involves function calls, it may not be possible to combine unrolling withStatic/manual loop unrolling
Manual (or static) loop unrolling involves the programmer analyzing the loop and interpreting the iterations into a sequence of instructions which will reduce the loop overhead. This is in contrast to dynamic unrolling which is accomplished by the compiler.Simple manual example in C
A procedure in a computer program is to delete 100 items from a collection. This is normally accomplished by means of a''for''
-loop which calls the function ''delete(item_number)''. If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared to those for the ''delete(x)'' function, unwinding can be used to speed it up.
As a result of this modification, the new program has to make only 20 iterations, instead of 100. Afterwards, only 20% of the jumps and conditional branches need to be taken, and represents, over many iterations, a potentially significant decrease in the loop administration overhead. To produce the optimal benefit, no variables should be specified in the unrolled code that require pointer arithmetic. This usually requires " base plus offset" addressing, rather than indexed referencing.
On the other hand, this ''manual'' loop unrolling expands the source code size from 3 lines to 7, that have to be produced, checked, and debugged, and the compiler may have to allocate more registers to store variables in the expanded loop iteration . In addition, the loop control variables and number of operations inside the unrolled loop structure have to be chosen carefully so that the result is indeed the same as in the original code (assuming this is a later optimization on already working code). For example, consider the implications if the iteration count were not divisible by 5. The manual amendments required also become somewhat more complicated if the test conditions are variables. See also Duff's device.
Early complexity
In the simple case, the loop control is merely an administrative overhead that arranges the productive statements. The loop itself contributes nothing to the results desired, merely saving the programmer the tedium of replicating the code a hundred times which could have been done by a pre-processor generating the replications, or a text editor. Similarly,if
-statements and other flow control statements could be replaced by code replication, except that code bloat can be the result. Computer programs easily track the combinations, but programmers find this repetition boring and make mistakes.
Consider:
But of course, the code performed need not be the invocation of a procedure, and this next example involves the index variable in computation:
which, if compiled, might produce a lot of code (''print'' statements being notorious) but further optimization is possible. This example makes reference only to ''x(i)'' and ''x(i - 1)'' in the loop (the latter only to develop the new value ''x(i)'') therefore, given that there is no later reference to the array ''x'' developed here, its usages could be replaced by a simple variable. Such a change would however mean a simple variable ''whose value is changed'' whereas if staying with the array, the compiler's analysis might note that the array's values are constant, each derived from a previous constant, and therefore carries forward the constant values so that the code becomes
print 2, 2;
print 3, 6;
print 4, 24;
...etc.
In general, the content of a loop might be large, involving intricate array indexing. These cases are probably best left to optimizing compilers to unroll. Replicating innermost loops might allow many possible optimisations yet yield only a small gain unless n is large.
Unrolling WHILE loops
Consider a pseudocode WHILE loop similar to the following: In this case, unrolling is faster because the ENDWHILE (a jump to the start of the loop) will be executed 66% less often. Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.Dynamic unrolling
Since the benefits of loop unrolling are frequently dependent on the size of an array—which may often not be known until run time— JIT compilers (for example) can determine whether to invoke a "standard" loop sequence or instead generate a (relatively short) sequence of individual instructions for each element. This flexibility is one of the advantages of just-in-time techniques versus static or manual optimization in the context of loop unrolling. In this situation, it is often with relatively small values of ''n'' where the savings are still useful—requiring quite small (if any) overall increase in program size (that might be included just once, as part of a standard library). Assembly language programmers (including optimizing compiler writers) are also able to benefit from the technique of dynamic loop unrolling, using a method similar to that used for efficient branch tables. Here, the advantage is greatest where the maximum offset of any referenced field in a particular array is less than the maximum offset that can be specified in a machine instruction (which will be flagged by the assembler if exceeded).Assembler example (IBM/360 or Z/Architecture)
This example is for IBM/360 or Z/Architecture assemblers and assumes a field of 100 bytes (at offset zero) is to be copied from array ''FROM'' to array ''TO''—both having 50 entries with element lengths of 256 bytes each.XC xx*256+100(156,R1),xx*256+100(R2)
, can be added immediately after every MVC in the sequence (where xx
matches the value in the MVC above it).
It is, of course, perfectly possible to generate the above code "inline" using a single assembler C example
The following example demonstrates dynamic loop unrolling for a simple program written in C. Unlike the assembler example above, pointer/index arithmetic is still generated by the compiler in this example because a variable (i) is still used to address the array element. Full optimization is only possible if absolute indexes are used in the replacement statements.C to MIPS assembly language loop unrolling example
The following example will compute adouble
. Here is the code in C:Converting to MIPS assembly language
The following is MIPS assembly code that will compute the dot product of two 100-entry vectors, A and B, before implementing loop unrolling. The code below omits the loop initializations: * Initialize loop count ($7) to 100. * Initialize dot product ($f10) to 0. * InitializeA /code> pointer ($5) to the base address of A
.
* Initialize B /code> pointer ($6) to the base address of B
.
Note that the size of one element of the arrays (a double
) is 8 bytes.
loop3:
l.d $f10, 0($5) ; $f10 ← A l.d $f12, 0($6) ; $f12 ← B mul.d $f10, $f10, $f12 ; $f10 ← A B add.d $f8, $f8, $f10 ; $f8 ← $f8 + A B addi $5, $5, 8 ; increment pointer for A by the size
; of a double.
addi $6, $6, 8 ; increment pointer for B by the size
; of a double.
addi $7, $7, -1 ; decrement loop count
test:
bgtz $7, loop3 ; Continue if loop count > 0
Unrolling the Loop in MIPS
The following is the same as above, but with loop unrolling implemented at a factor of 4. Note again that the size of one element of the arrays (a double
) is 8 bytes; thus the 0, 8, 16, 24 displacements and the 32 displacement on each loop.
loop3:
l.d $f10, 0($5) ; iteration with displacement 0
l.d $f12, 0($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
l.d $f10, 8($5) ; iteration with displacement 8
l.d $f12, 8($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
l.d $f10, 16($5) ; iteration with displacement 16
l.d $f12, 16($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
l.d $f10, 24($5) ; iteration with displacement 24
l.d $f12, 24($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
addi $5, $5, 32
addi $6, $6, 32
addi $7, $7, -4
test:
bgtz $7, loop3 ; Continue loop if $7 > 0
See also
* Don't repeat yourself
"Don't repeat yourself" (DRY) is a principle of software development aimed at reducing repetition of software patterns, replacing it with abstractions or using data normalization to avoid redundancy.
The DRY principle is stated as "Every piece o ...
* Duff's device
* Instruction level parallelism
* Just-in-time compilation
In computing, just-in-time (JIT) compilation (also dynamic translation or run-time compilations) is a way of executing computer code that involves compiler, compilation during execution of a program (at run time (program lifecycle phase), run tim ...
* Loop fusion
* Loop splitting
* Parallel computing
References
Further reading
*
External links
Chapter 7, pages 8 to 10
of Michael Abrash's ''Graphics Programming Black Book'' is about loop unrolling, with an example in x86 assembly.
Generalized Loop Unrolling
gives a concise introduction.
Optimizing subroutines in assembly language
Agner Fog's optimizations handbook with the ''loop unrolling'' technique (2012).
{{Compiler optimizations
Compiler optimizations
Articles with example C code
Parallel computing