Computational RAM (C-RAM) is

random-access memory Random-access memory (RAM; ) is a form of computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost t ...

with processing elements integrated on the same chip. This enables C-RAM to be used as a

SIMD Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should ...

computer. It also can be used to more efficiently use memory bandwidth within a memory chip.

Overview

The most influential implementations of computational RAM came from

The Berkeley IRAM Project The Berkeley IRAM project was a 1996–2004 research project in the Computer Science Division of the University of California, Berkeley which explored computer architecture enabled by the wide bandwidth between memory and processor made possibl ...

. Vector IRAM (V-IRAM) combines

DRAM Dynamic random-access memory (dynamic RAM or DRAM) is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal-oxi ...

with a

vector processor In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...

integrated on the same chip. Christoforos E. Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, et al
"Scalable Processors in the Billion-Transistor Era: IRAM"
IEEE Computer (magazine). 1997. says "Vector IRAM ... can operate as a parallel built-in self-test engine for the memory array, significantly reducing the DRAM testing time and the associated cost." Reconfigurable Architecture DRAM (RADram) is

with

reconfigurable computing Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like FPGA, field-programmable gate arrays (FPGA ...

FPGA A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturinghence the term ''Field-programmability, field-programmable''. The FPGA configuration is generally specifi ...

logic elements integrated on the same chip. Mark Oskin,

Frederic T. Chong Frederic (Fred) T. Chong is an American computer scientist known for research in computer architecture, quantum computing, and computer security. Born in New Brunswick, New Jersey, Chong received a BS in Electrical Engineering and Computer Scien ...

, and Timothy Sherwood
"Active Pages: A Computation Model for Intelligent Memory"
1998. SimpleScalar simulations show that RADram (in a system with a conventional processor) can give orders of magnitude better performance on some problems than traditional DRAM (in a system with the same processor). Some

embarrassingly parallel In parallel computing, an embarrassingly parallel workload or problem (also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel) is one where little or no effort is needed to separate the problem ...

computational problems are already limited by the von Neumann bottleneck between the CPU and the DRAM. Some researchers expect that, for the same total cost, a machine built from computational RAM will run orders of magnitude faster than a traditional general-purpose computer on these kinds of problems. As of 2011, the "DRAM process" (few layers; optimized for high capacitance) and the "CPU process" (optimized for high frequency; typically twice as many

BEOL The back end of line (BEOL) is the second portion of IC fabrication where the individual devices (transistors, capacitors, resistors, etc.) get interconnected with wiring on the wafer, the metalization layer. Common metals are copper and al ...

layers as DRAM; since each additional layer reduces yield and increases manufacturing cost, such chips are relatively expensive per square millimeter compared to DRAM) is distinct enough that there are three approaches to computational RAM: * starting with a CPU-optimized process and a device that uses much embedded SRAM, add an additional process step (making it even more expensive per square millimeter) to allow replacing the embedded SRAM with embedded DRAM ( eDRAM), giving ≈3x area savings on the SRAM areas (and so lowering net cost per chip). * starting with a system with a separate CPU chip and DRAM chip(s), add small amounts of "coprocessor" computational ability to the DRAM, working within the limits of the DRAM process and adding only small amounts of area to the DRAM, to do things that would otherwise be slowed down by the narrow bottleneck between CPU and DRAM: zero-fill selected areas of memory, copy large blocks of data from one location to another, find where (if anywhere) a given byte occurs in some block of data, etc. The resulting system—the unchanged CPU chip, and "smart DRAM" chip(s)—is at least as fast as the original system, and potentially slightly lower in cost. The cost of the small amount of extra area is expected to be more than paid back in savings in expensive test time, since there is now enough computational capability on a "smart DRAM" for a wafer full of DRAM to do most testing internally in parallel, rather than the traditional approach of fully testing one DRAM chip at a time with an expensive external

automatic test equipment Automatic test equipment or automated test equipment (ATE) is any apparatus that performs tests on a device, known as the device under test (DUT), equipment under test (EUT) or unit under test (UUT), using automation to quickly perform measurement ...

. * starting with a DRAM-optimized process, tweak the process to make it slightly more like the "CPU process", and build a (relatively low-frequency, but low-power and very high bandwidth) general-purpose CPU within the limits of that process. Some CPUs designed to be built on a DRAM process technology (rather than a "CPU" or "logic" process technology specifically optimized for CPUs) include

, TOMI Technology and the AT&T DSP1. Because a memory bus to off-chip memory has many times the capacitance of an on-chip memory bus, a system with separate DRAM and CPU chips can have several times the

energy consumption Energy consumption is the amount of energy used. Biology In the body, energy consumption is part of energy homeostasis. It derived from food energy. Energy consumption in the body is a product of the basal metabolic rate and the physical activi ...

of an IRAM system with the same

computer performance In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructio ...

. Because computational DRAM is expected to run hotter than traditional DRAM, and increased chip temperatures result in faster charge leakage from the DRAM storage cells, computational DRAM is expected to require more frequent

DRAM refresh Memory refresh is the process of periodically reading information from an area of computer memory and immediately rewriting the read information to the same area without modification, for the purpose of preserving the information."refresh cycle" ...

Processor-in-/near-memory

A processor-in-/near-memory (PINM) refers to a computer processor (CPU) tightly coupled to

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...

, generally on the same silicon chip. The chief goal of merging the processing and memory components in this way is to reduce

memory latency ''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will ha ...

and increase bandwidth. Alternatively reducing the distance that data needs to be moved reduces the power requirements of a system. Much of the complexity (and hence power consumption) in current processors stems from strategies to deal with avoiding memory stalls.

Examples

In the 1980s, a tiny CPU that executed

FORTH Forth or FORTH may refer to: Arts and entertainment * ''forth'' magazine, an Internet magazine * ''Forth'' (album), by The Verve, 2008 * ''Forth'', a 2011 album by Proto-Kaw * Radio Forth, a group of independent local radio stations in Scotla ...

was fabricated into a

chip to improve PUSH and POP.

is a stack-oriented programming language and this improved its efficiency. The transputer also had large on chip memory given that it was made in the early 1980s making it essentially a processor-in-memory. Notable PIM projects include the

Berkeley IRAM project The Berkeley IRAM project was a 1996–2004 research project in the Computer Science Division of the University of California, Berkeley which explored computer architecture enabled by the wide bandwidth between memory and processor made possible ...

(IRAM) at the

University of California, Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...

project and the

University of Notre Dame The University of Notre Dame du Lac, known simply as Notre Dame ( ) or ND, is a private Catholic university, Catholic research university in Notre Dame, Indiana, outside the city of South Bend, Indiana, South Bend. French priest Edward Sorin fo ...

PIM effort.

DRAM-based PIM Taxonomy

DRAM-based near-memory and in-memory designs can be categorized into four groups: * DIMM-level approaches place the processing units near memory chips. These approaches require minimal/no change in the data layout(e.g., Chameleon, Hadi Asghari-Moghaddam, et al., "Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems". and RecNMP Liu Ke, et al., "RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing". ). * Logic-layer-level approaches embed processing units in the logic layer of 3D stack memories and can benefit from the high bandwidth of 3D stack memories (e.g., TOP_PIM Dongping, Zhang, et al., "TOP-PIM: Throughput-oriented programmable processing in memory". ) * Bank-level approaches place processing units inside the memory layers, near each bank. UPMEM and Samsung's PIM Sukhan Lee, et al., "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product". are examples of these approaches * Subarray-level approaches process data inside each subarray. The Subarray-level approaches provide the highest access parallelism but often perform only simple operations, such as bitwise operations on an entire memory row (e.g., DRISA Shuangchen Li, et al.,"DRISA: A dram-based reconfigurable in-situ accelerator". ) or sequential processing of the memory row using a single-world ALU (e.g., Fulcrum Marzieh Lenjani, et al., "Fulcrum: a Simplified Control and Access Mechanism toward Flexible and Practical In-situ Accelerators". )

References

Bibliography

* Duncan Elliott, Michael Stumm, W. Martin Snelgrove, Christian Cojocaru, Robert McKenzie,
Computational RAM: Implementing Processors in Memory
, ''IEEE Design and Test of Computers'', vol. 16, no. 1, pp. 32–41, Jan–Mar 1999. {{doi, 10.1109/54.748803. Computer memory Computer architecture