In computer science, spatial architectures are a kind of computer architecture leveraging many collectively coordinated and directly communicating processing elements (PEs) to quickly and efficiently run embarrassingly parallel, highly parallelizable compute kernel, kernels. The "spatial" term comes from processing element instances being typically arranged in an array or grid, both logically and in the silicon design. Their most common workloads consist of matrix multiplications, convolutional neural network#Convolutional layers, convolutions, or, in general, tensor contractions. As such, spatial architectures are often used in AI accelerators. The key goal of a spatial architecture is to reduce the Latency (engineering)#Computer hardware and software systems, latency and Performance per watt, power consumption of running very large kernels through the exploitation of scalable parallelism and Locality of reference#Spatial and temporal locality usage, data reuse. Consider a kernel, i.e. a function to be applied to several inputs, expressed as one or more Inner loop, loops; this means distributing its computations between processing elements while ensuring that their data dependencies land either within the same element or the same region of elements. While spatial architectures can be Design flow (EDA), designed or Computer programming, programmed to support different algorithms, each workload must then be mapped onto the processing elements using specialized dataflows. Formulating a mapping involves the assignment of each operation to a processing element and the scheduling of the ensuing data movements. All tuned to maximize data parallelism and reuse. Spatial architectures are classifiable as a Single program, multiple data, SPMD (or Systolic array#Classification controversy, single function multiple data) array processor, in that each processing element runs the same operations on a different subset of data, yet they are still programmed through a single mapping. The architecture of an individual processing element can then itself belong to any Flynn's taxonomy, Flynn class. In particular, spatial architectures are well suited for applications whose dataflow exhibits producer-consumer relationships (e.g., Reduce (parallel pattern), parallel reduce) or can leverage efficient data sharing among a region of PEs. Spatial architectures can typically be found as hardware accelerators in Heterogeneous computing, heterogeneous systems, under the broader category of manycore processor.

Design details

Core element of spatial architecture is its multidimensional array of processing elements. Each processing element is simple, namely a Multiply–accumulate operation, multiply-and-accumulate execution unit, functional unit, a stripped-down cpu, core, or application-specific logic. Processing elements are then connected with each other and the memory hierarchy through System bus, busses or a network on chip, or even Asynchronous circuit, asynchronous logic. The memory hierarchy is explicitly managed and may consist of multiple on-chip Data buffer, buffers, like register files, Scratchpad memory, scratchpads, and FIFO (computing and electronics), FIFOs, backed by large off-chip Dynamic random-access memory, DRAM and Non-volatile memory, non-volatile memories. The number of processing elements, interconnect bandwidth, and amount of on-chip memory vary widely between designs and target applications. From thousands of processing elements and tens of megabytes of memory for high-performance computing to tens of elements and a few Kilobyte, kilobytes for the edge device, edge. The key Computer performance, performance metrics for a spatial architecture are its Energy delay product, consumed energy and latency when running a given workload. Due to Technology node, technology and Memory bandwidth, bandwidth limitations, the energy and latency required to access larger memories, like DRAM, dominate those of computation, being hundreds of times more than what's needed for storage near processing elements. That's why a spatial architecture's memory hierarchy is intended to Locality of reference#Spatial and temporal locality usage, localize most repeated value accesses on faster and more efficient on-chip memories, exploiting data reuse to minimize costly accesses.

Data reuse

The mechanisms that enable reuse in spatial architectures are Broadcast (parallel pattern), multicast and Reduction operator, reduction. Reuse can be further classified as spatial and temporal. Spatial architectures' interconnects can support spatial multicast as one read from an outer memory being used for multiple writes to inner instances, and spatial reduction, where reads from inner memories are accumulated in a single outer write. These can be implemented either with direct element-to-element forwarding, like in systolic arrays, or on the interconnect during memory accesses. Temporal reuse occurs when the same value is retained in a memory while being read (multicast) and/or updated in place (reduction) multiple times without it being re-fetched from another memory. Consider the case of kernels that can be computed with parallel arithmetic logic unit, ALU-like processing elements, such as matrix multiplications and convolutions. Direct inter-processing-element communication can be used effectively for passing Running total, partial sums to achieve spatially distributed accumulation, or sharing the same input data for parallel computation without repeated accesses to outer memories. The amount of data reuse that can be exploited is a property of the kernel being run, and can be inferred by Dependence analysis, analyzing its data dependencies. When the kernel can be expressed as a Polytope model, loop nest, reuse arises from subsequent iterations accessing, in part, the same values. This overlap is a form of locality of reference#Spatial and temporal locality usage, access locality and constitutes a reuse opportunity for spatial architectures often called "stationarity". For kernels presenting affine transformations of indices, like Convolution, convolutions and, more generally, Iterative Stencil Loops, stencil patterns, the partial overlap arising from the sliding window of the computation also yields a reuse opportunity, taking the name of "ghost zone" or "halo".
Naturally, spatial architectures are more effective the more reuse opportunities are present. At the same time, limited hardware resources mean that not all opportunities can be leveraged at once, requiring proper planning of the computation to exploit the most effective ones.

Mapping computations

To Computation offloading, run a kernel on a spatial architecture a mapping must be constructed, detailing how the Execution (computing), execution will unfold. Mapping a workload to a spatial architecture requires resource binding, binding each of its computations to a processing element and then loop scheduling, scheduling both computations and the data movements required to support them. A good choice of mapping is crucial to maximize performance. The starting point for a mapping is the Polytope model, loop nest representation of a kernel. To leverage parallelism and data reuse simultaneously, iterations must be divided between processing elements while taking care that inter-iteration data dependencies are handled by the same element or neighborhood of elements.

For illustrative purposes, the following mapping example focuses on a matrix multiplication, but everything remains generalizable to any Compute kernel, data-parallel kernel. This choice stems from the fact that most works on spatial architectures focus on neural networks support and related optimizations. Note that a similar parallelization of matrix multiplication has also been Matrix multiplication algorithm#Parallel and distributed algorithms, discussed in the context of multiprocessor systems. All mappings can be constructed through three loop transformations of the original kernel: * Loop splitting, Loop tiling (or Loop nest optimization, blocking), resulting in smaller and smaller block matrices that can fit on inner memories. This is repeated for every memory in the hierarchy, for each creating a nested copy of the original loops. At any moment during execution, a memory only needs to store all data required for iterations on its copies of the loops and inner ones. * Loop-level parallelism, Parallelization, similar to tiling, but different tiles are processed concurrently across multiple processing elements, rather than sequentially. * Loop interchange, Computation ordering, loops can be arbitrarily reordered inside each copy of the original loop nest. This alters data access patterns, changing which and when the same values are used in consecutive iterations. Each memory hierarchy level is in charge of progressing through its assigned iterations. After parallelization, each processing element runs the same inner loops on partially different data. Exploited data reuse is implicit in this formalism. Tiling enables temporal reuse, while parallelization enables spatial reuse. Finally, the order of computation determines which values can actually undergo reuse. Consider, for this example, a spatial architecture comprising a large scratchpad storing operands in their entirety and an array of processing elements, each with a small register file and a multiply-and-accumulate unit. Then, let the original kernel be this matrix multiplication: M, K, N = 128, 64, 256 for m in [0:M): for k in [0:K): for n in [0:N): Out[m][n] += W[m][k] * In[k][n] A complete mapping, with a tiled and parallelized kernel and a fully specified dataflow, can be written as follows. Iterations that are distributed across processing elements to run concurrently are marked with pfor: #

outer memory

= for m_mem in [0:8): for k_mem in [0:1): for n_mem in [0:16): #

= across processing elements

pfor m_par in [0:16): pfor k_par in [0:16): #

= inside processing elements

for n_pe in [0:16): for m_pe in [0:1): for k_pe in [0:4): m = m_mem*16 + m_par + m_pe k = k_mem*16*4 + k_par*4 + k_pe n = n_mem*16 + n_pe Out[m][n] += W[m][k] * In[k][n] Temporal data reuse can be observed in the form of stationarity, with outputs accumulated in-place during k_pe iterations and weights remaining stationary during n_mem ones. Spatial reuse occurs as the reduction of outputs over k_par and the multicast of inputs throughout m_par. Each processing element sees instead a unique tile of weights. While inferring the reuse opportunities exploited by a mapping, any loop with a single iteration must be ignored, while, for each operand, only loops affecting its dimensions need to be considered.

Mapping optimization

The number of possible mappings achievable varies depending on the target hardware, but is hardly less than billions, due to the large number of possible combinations resulting from the above decisions. As a result, finding the best set of transformations yielding the highest data reuse, and thus the lowest running latency and power consumption for a spatial architecture and kernel, is a challenging optimization problem. Most spatial architecture designs have been developed together with a tailored mapping technique. The complexity of the problem has, however, motivated the development of dedicated mapping tools that can work for a variety of spatial architectures and implement general Heuristic (computer science), heuristics that consistently find good mappings within reasonable time. Techniques successfully applied to the problem include: * Prune and search, Pruned search, since many mappings yield identical behavior, they can be pruned as redundant (e.g., reordering loop with a single iteration). Then, a random search is often able to find good mappings. * Genetic algorithms have been used to iteratively improve an initial set of diverse random mappings by transplanting between them the most successful loop transformations. * Simulated annealing also starts from a random pool of mappings, iteratively applying to them random transformations. Each transformation is then retained with a probability that is proportional to its performance improvement and inversely proportional to the elapsed time since the start of the exploration. * Integer programming can be applied by reformulating the problem of assigning kernel iterations to different loops and reordering them as a generalized assignment problem. It can then be solved through dedicated tools, like Gurobi.

Examples

Spatial architecture platforms

Application-specific integrated circuit, ASICs, fully custom hardware accelerator designs, are the most common form in which spatial architectures have been developed. This is mainly because ASICs mesh well with the efficiency design goals of spatial architectures. Field-programmable gate array, FPGAs can be seen as fine-grained and highly flexible spatial architectures. The same applies to Reconfigurable computing#Classification of systems, CGRAs. However, both are not limited to following the spatial architecture paradigm, as they may, for example, be Reconfigurable computing, reconfigured to run most arbitrary tasks. Therefore, they should only be considered a spatial architecture when set up to operate as one. In fact, several spatial architecture designs have been developed for deployment on FPGAs. Systolic arrays are a form of spatial architecture, in that they employ a mesh of computing nodes with a programmable interconnect, allowing computations to unfold while data moves in lock-step from node to node. The computational flow graph of systolic arrays naturally aligns with the pfors of spatial architecture mappings. Asynchronous array of simple processors, Asynchronous arrays of simple processors are a precursor of spatial architectures following the MIMD paradigm and targeted towards digital signal processing workloads. Dataflow architectures are also a forerunner of spatial architectures as a General-purpose computer, general-purpose approach to exploit parallelism across several functional units. They run a Computer program, program by starting each computations as soon as its data dependencies are satisfied and the required hardware is available. Spatial architectures simplified this concept by targeting specific kernels. Rather than driving execution based on data readiness, they statically use the kernel's data dependencies to define the whole architecture's dataflow prior to execution through a mapping.

Not spatial architectures

Digital signal processors are highly specialized processors with custom datapaths to perform many arithmetic operations quickly, concurrently, and repeatedly on a series of data samples. Despite commonalities in target kernels, a single digital signal processor is not a spatial architecture, lacking its inherent spatial parallelism over an array of processing elements. Nonetheless, digital signal processors can be found in FPGAs and CGRAs, where they could be part of a there-instantiated, larger, spatial architecture design. Tensor Core present in List of Nvidia graphics processing units, Nvidia GPUs since the Volta series, while accelerating matrix multiplication, do not classify as spatial architectures either, as they are hardwired functional units that do not expose spatial features by themselves. Again, a Hopper (microarchitecture)#Architecture, streaming multiprocessor, containing multiple tensor cores, is not a spatial architecture, but an instance of Single instruction, multiple threads, SIMT, due to its control being shared across several GPU threads.

Emergent or unconventional spatial architectures

In-memory processing, In-memory computing proposes to perform computations on the data directly inside the Computer memory, memory it is stored in. Its goal is to improve a computation's efficiency and density by sparing costly data transfers and reusing the existing memory hardware. For instance, one operand of a matrix multiplication could be stored in memory, while the other is gradually brought in, and the memory itself produces the final product. When each group of Memory cell (computing), memory cells that performs a computation between the stored and an incoming operand, such as a multiplication, is viewed as a processing element, an in-memory computing-capable memory bank can be seen as a spatial architecture with a predetermined dataflow. The bank's width and height forming the characteristic pfors. Cognitive computers developed as part of research on Neuromorphic computing, neuromorphic systems are instances of spatial architectures targeting the acceleration of spiking neural networks. Each of their processing elements is a core handling several neurons and their synapses. It receives Action potential, spikes directed to its neurons from other cores, Biological neuron model#Leaky integrate-and-fire, integrates them, and eventually propagates produced spikes. Cores are connected through a network-on-chip and usually operate Asynchronous system, asynchronously. Their mapping consists of assigning neurons to cores while minimizing the total distance traveled by spikes, which acts as a proxy for energy and latency.

Specific implementations

Produced or prototyped spatial architectures as coprocessors, independent accelerators: * Eyeriss: a deep learning accelerator developed by MIT's MIT Computer Science and Artificial Intelligence Laboratory, CSAIL laboratory, in particular Vivienne Sze's team, and presented in 2016. It employs a 108 KB scratchpad and a grid of processing elements, each with a 0.5 KB register file. A successor, Eyeriss v2 has also been designed, implementing a hierarchical interconnect between processing elements to compensate for the lack of bandwidth in the original. * DianNao: a family of deep learning accelerators developed at Institute of Computing Technology (CAS), ICT that offers both edge and high-performance computing oriented variants. Their base architecture uses reconfigurable arrays of Binary multiplier, multipliers, Adder (electronics), adders, and Activation function, activation-specific functional units to parallelize most Layer (deep learning), deep learning layers. * Simba (hardware), Simba: an experimental multi-chip module spatial architecture developed by Nvidia. Each chip has roughly 110 KB of memory and features 16 processing elements, each containing a vector multiply-and-accumulate unit capable of performing a dot product between 8-element vectors. Up to chips have been installed in the same module. * NVDLA: an open-source, parametric, unidimensional array of processing elements specialized for convolutions, developed by Nvidia. * Tensor Processing Unit (TPU): developed by Google and internally deployed in its datacenters since 2015, its first version employed a large systolic array capable of 92 TeraOps/second and a large 28 MB software-managed on-chip memory. Several subsequent versions have been developed with increasing capabilities. * Cognitive computer#IBM TrueNorth chip, TrueNorth: a neuromorphic chip produced by IBM in 2014. It features 4096 cores, capable of handling 256 simulated neurons and 64k synapses. It does not have a global Clock signal, clock and cores operate as event-driven by using both synchronous and asynchronous logic. Spatial architectures integrated into existing products or System on a chip, platforms: * Gemmini: systolic array-based deep learning accelerator developed by University of California, Berkeley, UC Berkeley as part of their open-source RISC-V ecosystem. Its base configuration is a array with 512 KB of memory, and is intended to be controlled via a tightly coupled core. * AI engine, AI Engine: an accelerator developed by AMD and integrated in their Ryzen#Ryzen AI, Ryzen AI series of products. In it, each processing element is a SIMD-capable very long instruction word, VLIW core, increasing the flexibility of the spatial architecture and enabling it to also exploit task parallelism. Workloads demonstrated to run on these spatial architectures include: AlexNet, Residual neural network, ResNet, BERT (language model), BERT, Computational science, Scientific computing.

References

External links

Eyeriss project (MIT)

* [https://timeloop.csail.mit.edu/ Timeloop/Accelergy Overview]
Nvidia Research: Simba
Hardware acceleration Computer architecture Classes of computers Parallel computing Manycore processors {{Authority control