In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
and particularly in
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
design, loop nest optimization (LNO) is an optimization technique that applies a set of
loop transformations for the purpose of
locality optimization or parallelization or another loop overhead reduction of the loop nests. (
Nested loops occur when one loop is inside of another loop.) One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common
linear algebra
Linear algebra is the branch of mathematics concerning linear equations such as:
:a_1x_1+\cdots +a_nx_n=b,
linear maps such as:
:(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n,
and their representations in vector spaces and through matric ...
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s.
The technique used to produce this optimization is called loop tiling,
also known as loop blocking
or strip mine and interchange.
Overview
Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the
cache until it is reused. The partitioning of loop iteration space leads to partitioning of a large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
An ordinary loop
for (i=0; i
can be blocked with a block size B by replacing it with
for (j=0; j
where min()
is a function returning the minimum of its arguments.
Example: matrix-vector multiplication
The following is an example of matrix vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.
int i, j, a 00100], b 00 c 00
int n = 100;
for (i = 0; i < n; i++)
After loop tiling is applied using 2 * 2 blocks, the code looks like:
int i, j, x, y, a 00100], b 00 c 00
int n = 100;
for (i = 0; i < n; i += 2)
The original loop iteration space is ''n'' by ''n''. The accessed chunk of array a, j
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
is also ''n'' by ''n''. When ''n'' is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example, i = 1
, j = 1 to n
) may cross cache lines, causing cache misses.
Tiling size
It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests ( loop interchange) also plays an important role in achieving better cache performance. Explicit blocking requires choosing a tile size based on these factors. By contrast, cache-oblivious algorithm
In computing, a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a processor cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter. An ...
s are designed to make efficient use of cache without explicit blocking.
Example: matrix multiplication
Many large mathematical operations on computers end up spending much of their time doing matrix multiplication
In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the ...
. The operation is:
: C = A×B
where A, B, and C are N×N arrays. Subscripts, for the following description, are in form C owcolumn]
.
The basic loop is:
int i, j, k;
for (i = 0; i < N; ++i)
There are three problems to solve:
* Floating point additions take some number of cycles to complete. In order to keep an adder (electronics), adder with multiple cycle latency busy, the code must update multiple accumulators in parallel.
* Machines can typically do just one memory operation per multiply-add, so values loaded must be reused at least twice.
* Typical PC memory systems can only sustain one 8-byte doubleword per 10–30 double-precision multiply–adds, so values loaded into the cache must be reused many times.
The original loop calculates the result for one entry in the result matrix at a time. By calculating a small block of entries simultaneously, the following loop reuses each loaded value twice, so that the inner loop has four loads and four multiply–adds, thus solving problem #2. By carrying four accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time (problem #1). However, the code does not address the third problem. (Nor does it address the cleanup work necessary when N is odd. Such details will be left out of the following discussion.)
for (i = 0; i < N; i += 2)
This code has had both the i
and j
iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.
This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, build in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the ISA
Isa or ISA may refer to:
Places
* Isa, Amur Oblast, Russia
* Isa, Kagoshima, Japan
* Isa, Nigeria
* Isa District, Kagoshima, former district in Japan
* Isa Town, middle class town located in Bahrain
* Mount Isa, Queensland, Australia
* Mount Is ...
. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.
Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This ''register pressure
Register or registration may refer to:
Arts entertainment, and media Music
* Register (music), the relative "height" or range of a note, melody, part, instrument, etc.
* ''Register'', a 2017 album by Travis Miller
* Registration (organ), t ...
'' is why vendors of RISC
In computer engineering, a reduced instruction set computer (RISC) is a computer designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a complex instruction set compu ...
CPUs, who intended to build machines more parallel than the general purpose x86 and 68000
The Motorola 68000 (sometimes shortened to Motorola 68k or m68k and usually pronounced "sixty-eight-thousand") is a 16/32-bit complex instruction set computer (CISC) microprocessor, introduced in 1979 by Motorola Semiconductor Products Secto ...
CPUs, adopted 32-entry floating-point register file
A register file is an array of processor registers in a central processing unit (CPU). Register banking is the method of using a single name to access multiple different physical registers depending on the operating mode. Modern integrated circuit ...
s.
The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of A is loaded, and the entire matrix B is loaded. For the entire calculation, C is stored once (that's good), A is loaded into the cache once (assuming a stripe of A fits in the cache with a stripe of B), but B is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N3/ib doubleword loads from main memory. In the code above, ''ib'' is 2.
The next step to reduce the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8 GHz Pentium 4 system used for this example, the balance number is 16.5. The second code example above cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is ''blocked'' over i. (Technically, this is actually the second time i is blocked, as the first time was the factor of 2.)
for (ii = 0; ii < N; ii += ib)
With this code, ib can be set to any desired parameter, and the number of loads of the B matrix will be reduced by that factor. This freedom has a cost: N×ib slices of the A matrix are being kept in the cache. As long as that fits, this code will not be limited by the memory system.
So what size matrix fits? The example system, a 2.8 GHz Pentium 4, has a 16KB primary data cache. With ib=20, the slice of the A matrix in this code will be larger than the primary cache when N > 100. For problems larger than that, another trick is needed.
That trick is reducing the size of the stripe of the B matrix by blocking the k loop so that the stripe is of size ib × kb. Blocking the k loop means that the C array will be loaded and stored N/kb times, for a total of memory transfers. B is still transferred N/ib times, for transfers. So long as
: 2*N/kb + N/ib < N/balance
the machine's memory system will keep up with the floating point unit and the code will run at maximum performance. The 16KB cache of the Pentium 4 is not quite big enough: if ib=24 and kb=64 were chosen instead, 12KB of the cache would be used—avoiding completely filling it, which is desirable so the C and B arrays have to have some room to flow through. These numbers come within 20% of the peak floating-point speed of the processor.
Here is the code with loop k
blocked.
for (ii = 0; ii < N; ii += ib)
The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably split
Split(s) or The Split may refer to:
Places
* Split, Croatia, the largest coastal city in Croatia
* Split Island, Canada, an island in the Hudson Bay
* Split Island, Falkland Islands
* Split Island, Fiji, better known as Hạfliua
Arts, entertain ...
the kk 0 iteration off from the rest of the kk
iterations, to remove the if statement from the i
loop. This is one of the values of such a compiler: while it is straightforward to code the simple cases of this optimization, keeping all the details correct as the code is replicated and transformed is an error-prone process.
The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the Pentium 4 has 256KB (or more, depending on the model) high-bandwidth level-2 cache as well as the level-1 cache. There is a choice:
* Adjust the block sizes for the level-2 cache. This will stress the processor's ability to keep many instructions in flight simultaneously, and there is a good chance it will be unable to achieve full bandwidth from the level-2 cache.
* Block the loops again, again for the level-2 cache sizes. With a total of three levels of blocking (for the register file, for the L1 cache, and the L2 cache), the code will minimize the required bandwidth at each level of the memory hierarchy
In computer architecture, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlli ...
. Unfortunately, the extra levels of blocking will incur still more loop overhead, which for some problem sizes on some hardware may be more time-consuming than any shortcomings in the hardware's ability to stream data from the L2 cache.
Rather than specifically tune for one particular cache size, as in the first example, a cache-oblivious algorithm
In computing, a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a processor cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter. An ...
is designed to take advantage of any available cache, no matter what its size is. This automatically takes advantage of two or more levels of memory hierarchy, if available. Cache-oblivious algorithms for matrix multiplication
In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the ...
are known.
See also
* Duff's device
* Loop optimization
References
Further reading
# Wolfe, M.
More Iteration Space Tiling
'. Supercomputing'89, pages 655–664, 1989.
# Wolf, M. E. and Lam, M. '
A Data Locality Optimizing Algorithm
'. PLDI'91, pages 30–44, 1991.
# Irigoin, F. and Triolet, R.
Supernode Partitioning
'. POPL'88, pages 319–329, 1988.
# Xue, J.
Loop Tiling for Parallelism
'. Kluwer Academic Publishers. 2000.
# M. S. Lam, E. E. Rothberg, and M. E. Wolf
The cache performance and optimizations of blocked algorithms
In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991.
External links
showing the overall balance between floating point operations and memory operations for many different computers
"CHiLL: Composable High-Level Loop Transformation Framework"
{{Authority control
Compiler optimizations
Articles with example C code