computing Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...

, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of

hardware cache In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsew ...

s, using foreknowledge of the

memory access pattern In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache perform ...

supplied by the

programmer A programmer, computer programmer or coder is an author of computer source code someone with skill in computer programming. The professional titles Software development, ''software developer'' and Software engineering, ''software engineer' ...

compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...

. They may reduce

cache pollution Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other useful data to be evicted from the cache into lower levels of the memory hierarchy, degrading performance. For e ...

, reduce bandwidth requirement, bypass latencies, by providing better control over the

working set Working set is a concept in computer science which defines the amount of memory that a process (computing), process requires in a given time interval. Definition Peter_J._Denning, Peter Denning (1968) defines "the working set of information W(t ...

. Most cache control instructions do not affect the semantics of a program, although some can.

Examples

Several such instructions, with variants, are supported by several processor

instruction set In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, s ...

architectures, such as ARM, MIPS,

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...

, and

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...

Prefetch

Also termed ''data cache block touch'', the effect is to request loading the cache line associated with a given address. This is performed by the PREFETCH instruction in the

instruction set. Some variants bypass higher levels of the

cache hierarchy Cache hierarchy, or multi-level cache, is a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by cent ...

, which is useful in a 'streaming' context for data that is traversed once, rather than held in the working set. The

prefetch Prefetching is a technique used in computing to improve performance by retrieving data or instructions before they are needed. By predicting what a program will request in the future, the system can load information in advance to reduced wait times ...

should occur sufficiently far ahead in time to mitigate the latency of memory access, for example in a loop traversing memory linearly. The

GNU Compiler Collection The GNU Compiler Collection (GCC) is a collection of compilers from the GNU Project that support various programming languages, Computer architecture, hardware architectures, and operating systems. The Free Software Foundation (FSF) distributes ...

intrinsic function In computer software, in compiler theory, an intrinsic function, also called built-in function or builtin function, is a function ( subroutine) available for use in a given programming language whose implementation is handled specially by the com ...

__builtin_prefetch can be used to invoke this in the programming languages C or C++.

Instruction prefetch

A variant of prefetch for the instruction cache.

Data cache block allocate zero

This hint is used to prepare cache lines before overwriting the contents completely. In this example, the CPU needn't load anything from

main memory Computer data storage or digital data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processin ...

. The semantic effect is equivalent to an aligned memset of a cache-line sized block to zero, but the operation is effectively free.

Data cache block invalidate

This hint is used to discard cache lines, without committing their contents to main memory. Care is needed since incorrect results are possible. Unlike other cache hints, the semantics of the program are significantly modified. This is used in conjunction with allocate zero for managing temporary data. This saves unneeded main memory bandwidth and cache pollution.

Data cache block flush

This hint requests the immediate eviction of a cache line, making way for future allocations. It is used when it is known that data is no longer part of the

Other hints

Some processors support a variant of load–store instructions that also imply cache hints. An example is load last in the

instruction set, which suggests that data will only be used once, i.e., the cache line in question may be pushed to the head of the eviction queue, whilst keeping it in use if still directly needed.

Alternatives

Automatic prefetch

In recent times, cache control instructions have become less popular as increasingly advanced application processor designs from

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...

and ARM devote more transistors to accelerating code written in traditional languages, e.g., performing automatic prefetch, with hardware to detect linear access patterns on the fly. However the techniques may remain valid for throughput-oriented processors, which have a different throughput vs latency tradeoff, and may prefer to devote more area to execution units.

Scratchpad memory

Some processors support scratchpad memory into which temporaries may be put, and

direct memory access Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system computer memory, memory independently of the central processing unit (CPU). Without DMA, when the CPU is using programmed i ...

(DMA) to transfer data to and from

when needed. This approach is used by the Cell processor, and some

embedded system An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is e ...

s. These allow greater control over memory traffic and locality (as the working set is managed by explicit transfers), and eliminates the need for expensive

cache coherency In computer architecture, cache coherence is the uniformity of shared resource data that is stored in multiple local caches. In a cache coherent system, if multiple clients have a cached copy of the same region of a shared memory resource, all ...

in a manycore machine. The disadvantage is it requires significantly different programming techniques to use. It is very hard to adapt programs written in traditional languages such as C and C++ which present the programmer with a uniform view of a large address space (which is an illusion simulated by caches). A traditional microprocessor can more easily run legacy code, which may then be accelerated by cache control instructions, whilst a scratchpad based machine requires dedicated coding from the ground up to even function. Cache control instructions are specific to a certain cache line size, which in practice may vary between generations of processors in the same architectural family. Caches may also help coalescing reads and writes from less predictable access patterns (e.g., during

texture mapping Texture mapping is a term used in computer graphics to describe how 2D images are projected onto 3D models. The most common variant is the UV unwrap, which can be described as an inverse paper cutout, where the surfaces of a 3D model are cut ap ...

), whilst scratchpad DMA requires reworking algorithms for more predictable 'linear' traversals. As such scratchpads are generally harder to use with traditional programming models, although

dataflow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Dat ...

models (such as

TensorFlow TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...

) might be more suitable.

Vector fetch

Vector processor In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...

s (for example modern

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

(GPUs) and Xeon Phi) use massive parallelism to achieve high throughput whilst working around memory latency (reducing the need for prefetching). Many read operations are issued in parallel, for subsequent invocations of a

compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...

; calculations may be put on hold awaiting future data, whilst the execution units are devoted to working on data from past requests data that has already turned up. This is easier for programmers to leverage in conjunction with the appropriate programming models (

s), but harder to apply to general purpose programming. The disadvantage is that many copies of temporary states may be held in the local memory of a

processing element This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices. A ...

, awaiting data in flight.

References

{{Reflist Computer architecture