In
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...
, a cache control instruction is a hint embedded in the
instruction stream of a
processor intended to improve the performance of
hardware cache
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsew ...
s, using foreknowledge of the
memory access pattern In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache perform ...
supplied by the
programmer
A programmer, computer programmer or coder is an author of computer source code someone with skill in computer programming.
The professional titles Software development, ''software developer'' and Software engineering, ''software engineer' ...
or
compiler
In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
.
They may reduce
cache pollution
Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other useful data to be evicted from the cache into lower levels of the memory hierarchy, degrading performance. For e ...
, reduce bandwidth requirement, bypass latencies, by providing better control over the
working set
Working set is a concept in computer science which defines the amount of memory that a process (computing), process requires in a given time interval.
Definition
Peter_J._Denning, Peter Denning (1968) defines "the working set of information W(t ...
. Most cache control instructions do not affect the semantics of a program, although some can.
Examples
Several such instructions, with variants, are supported by several processor
instruction set
In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, s ...
architectures, such as
ARM,
MIPS,
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
, and
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
.
Prefetch
Also termed ''data cache block touch'', the effect is to request loading the cache line associated with a given address. This is performed by the
PREFETCH
instruction in the
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
instruction set. Some variants bypass higher levels of the
cache hierarchy
Cache hierarchy, or multi-level cache, is a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by cent ...
, which is useful in a 'streaming' context for data that is traversed once, rather than held in the working set. The
prefetch Prefetching is a technique used in computing to improve performance by retrieving data or instructions before they are needed. By predicting what a program will request in the future, the system can load information in advance to reduced wait times ...
should occur sufficiently far ahead in time to mitigate the
latency of memory access, for example in a loop traversing memory linearly. The
GNU Compiler Collection
The GNU Compiler Collection (GCC) is a collection of compilers from the GNU Project that support various programming languages, Computer architecture, hardware architectures, and operating systems. The Free Software Foundation (FSF) distributes ...
intrinsic function
In computer software, in compiler theory, an intrinsic function, also called built-in function or builtin function, is a function ( subroutine) available for use in a given programming language whose implementation is handled specially by the com ...
__builtin_prefetch
can be used to invoke this in the programming languages
C or
C++.
Instruction prefetch
A variant of prefetch for the instruction cache.
Data cache block allocate zero
This hint is used to prepare cache lines before overwriting the contents completely. In this example, the CPU needn't load anything from
main memory
Computer data storage or digital data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processin ...
. The semantic effect is equivalent to an aligned
memset of a cache-line sized block to zero, but the operation is effectively free.
Data cache block invalidate
This hint is used to discard cache lines, without committing their contents to main memory. Care is needed since incorrect results are possible. Unlike other cache hints, the semantics of the program are significantly modified. This is used in conjunction with
allocate zero
for managing temporary data. This saves unneeded main memory bandwidth and cache pollution.
Data cache block flush
This hint requests the immediate eviction of a cache line, making way for future allocations. It is used when it is known that data is no longer part of the
working set
Working set is a concept in computer science which defines the amount of memory that a process (computing), process requires in a given time interval.
Definition
Peter_J._Denning, Peter Denning (1968) defines "the working set of information W(t ...
.
Other hints
Some processors support a variant of
load–store instructions that also imply cache hints. An example is
load last
in the
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
instruction set, which suggests that data will only be used once, i.e., the cache line in question may be pushed to the head of the eviction queue, whilst keeping it in use if still directly needed.
Alternatives
Automatic prefetch
In recent times, cache control instructions have become less popular as increasingly advanced application processor designs from
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
and
ARM devote more transistors to accelerating code written in traditional languages, e.g., performing automatic prefetch, with hardware to detect linear access patterns on the fly. However the techniques may remain valid for throughput-oriented processors, which have a different throughput vs latency tradeoff, and may prefer to devote more area to execution units.
Scratchpad memory
Some processors support
scratchpad memory into which temporaries may be put, and
direct memory access
Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system computer memory, memory independently of the central processing unit (CPU).
Without DMA, when the CPU is using programmed i ...
(DMA) to transfer data to and from
main memory
Computer data storage or digital data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processin ...
when needed. This approach is used by the
Cell processor, and some
embedded system
An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is e ...
s. These allow greater control over memory traffic and locality (as the working set is managed by explicit transfers), and eliminates the need for expensive
cache coherency
In computer architecture, cache coherence is the uniformity of shared resource data that is stored in multiple local caches. In a cache coherent system, if multiple clients have a cached copy of the same region of a shared memory resource, all ...
in a
manycore machine.
The disadvantage is it requires significantly different programming techniques to use. It is very hard to adapt programs written in traditional languages such as C and C++ which present the programmer with a uniform view of a large address space (which is an illusion simulated by caches). A traditional microprocessor can more easily run legacy code, which may then be accelerated by cache control instructions, whilst a scratchpad based machine requires dedicated coding from the ground up to even function. Cache control instructions are specific to a certain cache line size, which in practice may vary between generations of processors in the same architectural family. Caches may also help coalescing reads and writes from less predictable access patterns (e.g., during
texture mapping
Texture mapping is a term used in computer graphics to describe how 2D images are projected onto 3D models. The most common variant is the UV unwrap, which can be described as an inverse paper cutout, where the surfaces of a 3D model are cut ap ...
), whilst scratchpad DMA requires reworking algorithms for more predictable 'linear' traversals.
As such scratchpads are generally harder to use with traditional programming models, although
dataflow
In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming.
Software architecture
Dat ...
models (such as
TensorFlow
TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...
) might be more suitable.
Vector fetch
Vector processor
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
s (for example modern
graphics processing unit
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
(GPUs) and
Xeon Phi) use massive
parallelism to achieve high throughput whilst working around memory latency (reducing the need for prefetching). Many read operations are issued in parallel, for subsequent invocations of a
compute kernel
In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...
; calculations may be put on hold awaiting future data, whilst the execution units are devoted to working on data from past requests data that has already turned up. This is easier for programmers to leverage in conjunction with the appropriate programming models (
compute kernel
In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...
s), but harder to apply to general purpose programming.
The disadvantage is that many copies of temporary states may be held in the
local memory of a
processing element
This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices.
A
...
, awaiting data in flight.
References
{{Reflist
Computer architecture