In
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...
, a cache control instruction is a hint embedded in the
instruction stream of a
processor intended to improve the performance of
hardware cache
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
s, using foreknowledge of the
memory access pattern supplied by the
programmer
A computer programmer, sometimes referred to as a software developer, a software engineer, a programmer or a coder, is a person who creates computer programs — often for larger computer software.
A programmer is someone who writes/creates ...
or
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
.
They may reduce
cache pollution
Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other useful data to be evicted from the cache into lower levels of the memory hierarchy, degrading performance. For e ...
, reduce bandwidth requirement, bypass latencies, by providing better control over the
working set. Most cache control instructions do not affect the semantics of a program, although some can.
Examples
Several such instructions, with variants, are supported by several processor
instruction set
In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called a ...
architectures, such as
ARM,
MIPS,
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple– IBM ...
, and
x86.
Prefetch
Also termed ''data cache block touch'', the effect is to request loading the cache line associated with a given address. This is performed by the
PREFETCH
instruction in the
x86 instruction set. Some variants bypass higher levels of the
cache hierarchy, which is useful in a 'streaming' context for data that is traversed once, rather than held in the working set. The
prefetch
Prefetching in computer science is a technique for speeding up fetch operations by beginning a fetch operation whose result is expected to be needed soon. Usually this is before it is ''known'' to be needed, so there is a risk of wasting time by p ...
should occur sufficiently far ahead in time to mitigate the
latency of memory access, for example in a loop traversing memory linearly. The
GNU Compiler Collection
The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free sof ...
intrinsic function
In computer software, in compiler theory, an intrinsic function (or built-in function) is a function (subroutine) available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it may subst ...
__builtin_prefetch
can be used to invoke this in the programming languages
C or
C++.
Instruction prefetch
A variant of prefetch for the instruction cache.
Data cache block allocate zero
This hint is used to prepare cache lines before overwriting the contents completely. In this example, the CPU needn't load anything from
main memory
Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processing unit (CPU) of a comput ...
. The semantic effect is equivalent to an aligned
memset
The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. ...
of a cache-line sized block to zero, but the operation is effectively free.
Data cache block invalidate
This hint is used to discard cache lines, without committing their contents to main memory. Care is needed since incorrect results are possible. Unlike other cache hints, the semantics of the program are significantly modified. This is used in conjunction with
allocate zero
for managing temporary data. This saves unneeded main memory bandwidth and cache pollution.
Data cache block flush
This hint requests the immediate eviction of a cache line, making way for future allocations. It is used when it is known that data is no longer part of the
working set.
Other hints
Some processors support a variant of
load–store instructions that also imply cache hints. An example is
load last
in the
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple– IBM ...
instruction set, which suggests that data will only be used once, i.e., the cache line in question may be pushed to the head of the eviction queue, whilst keeping it in use if still directly needed.
Alternatives
Automatic prefetch
In recent times, cache control instructions have become less popular as increasingly advanced application processor designs from
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the devel ...
and
ARM devote more transistors to accelerating code written in traditional languages, e.g., performing automatic prefetch, with hardware to detect linear access patterns on the fly. However the techniques may remain valid for throughput-oriented processors, which have a different throughput vs latency tradeoff, and may prefer to devote more area to execution units.
Scratchpad memory
Some processors support
scratchpad memory into which temporaries may be put, and
direct memory access
Direct memory access (DMA) is a feature of computer systems and allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).
Without DMA, when the CPU is using programmed input/output, it is ...
(DMA) to transfer data to and from
main memory
Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processing unit (CPU) of a comput ...
when needed. This approach is used by the
Cell processor, and some
embedded system
An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded'' ...
s. These allow greater control over memory traffic and locality (as the working set is managed by explicit transfers), and eliminates the need for expensive
cache coherency
In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, whi ...
in a
manycore
Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores (from a few tens of cores to thousands or more). Manycore processors are us ...
machine.
The disadvantage is it requires significantly different programming techniques to use. It is very hard to adapt programs written in traditional languages such as C and C++ which present the programmer with a uniform view of a large address space (which is an illusion simulated by caches). A traditional microprocessor can more easily run legacy code, which may then be accelerated by cache control instructions, whilst a scratchpad based machine requires dedicated coding from the ground up to even function. Cache control instructions are specific to a certain cache line size, which in practice may vary between generations of processors in the same architectural family. Caches may also help coalescing reads and writes from less predictable access patterns (e.g., during
texture mapping
Texture mapping is a method for mapping a texture on a computer-generated graphic. Texture here can be high frequency detail, surface texture, or color.
History
The original technique was pioneered by Edwin Catmull in 1974.
Texture mappi ...
), whilst scratchpad DMA requires reworking algorithms for more predictable 'linear' traversals.
As such scratchpads are generally harder to use with traditional programming models, although
dataflow
In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming.
Software architecture
Da ...
models (such as
TensorFlow) might be more suitable.
Vector fetch
Vector processor
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
s (for example modern
graphics processing unit
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mo ...
(GPUs) and
Xeon Phi) use massive
parallelism to achieve high throughput whilst working around memory latency (reducing the need for prefetching). Many read operations are issued in parallel, for subsequent invocations of a
compute kernel
In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...
; calculations may be put on hold awaiting future data, whilst the execution units are devoted to working on data from past requests data that has already turned up. This is easier for programmers to leverage in conjunction with the appropriate programming models (
compute kernel
In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...
s), but harder to apply to general purpose programming.
The disadvantage is that many copies of temporary states may be held in the
local memory of a
processing element, awaiting data in flight.
References
{{Reflist
Computer architecture