The Cell Broadband Engine (Cell/B.E.) is a 64-bit
multi-core processor
A multi-core processor (MCP) is a microprocessor on a single integrated circuit (IC) with two or more separate central processing units (CPUs), called ''cores'' to emphasize their multiplicity (for example, ''dual-core'' or ''quad-core''). Ea ...
and
microarchitecture developed by
Sony
is a Japanese multinational conglomerate (company), conglomerate headquartered at Sony City in Minato, Tokyo, Japan. The Sony Group encompasses various businesses, including Sony Corporation (electronics), Sony Semiconductor Solutions (i ...
,
Toshiba
is a Japanese multinational electronics company headquartered in Minato, Tokyo. Its diversified products and services include power, industrial and social infrastructure systems, elevators and escalators, electronic components, semiconductors ...
, and
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
—an alliance known as "STI". It combines a general-purpose
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
core, called the Power Processing Element (PPE), with multiple specialized
coprocessors, known as Synergistic Processing Elements (SPEs), which accelerate tasks such as
multimedia
Multimedia is a form of communication that uses a combination of different content forms, such as Text (literary theory), writing, Sound, audio, images, animations, or video, into a single presentation. T ...
and
vector processing
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its Instruction (computer science), instructions are designed to operate efficiently and effectively on large Array d ...
.
The architecture was developed over a four-year period beginning in March 2001, with Sony reporting a development budget of approximately . Its first major commercial application was in Sony's
PlayStation 3
The PlayStation 3 (PS3) is a home video game console developed and marketed by Sony Computer Entertainment (SCE). It is the successor to the PlayStation 2, and both are part of the PlayStation brand of consoles. The PS3 was first released on ...
home video game console, released in 2006. In 2008, a modified version of the Cell processor powered IBM's
Roadrunner, the first supercomputer to sustain one
petaFLOPS
Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.
For such cases, it is a more accurate measu ...
. Other applications include high-performance computing systems from
Mercury Computer Systems and specialized
arcade system boards.
Cell emphasizes
memory coherence, power efficiency, and peak
computational throughput, but its design presented significant challenges for software development. IBM offered a
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
-based
software development kit to facilitate programming on the platform.
History

In mid-2000, Sony, Toshiba, and IBM formed the STI alliance to develop a new microprocessor. The STI Design Center opened in March 2001 in
Austin, Texas
Austin ( ) is the List of capitals in the United States, capital city of the U.S. state of Texas. It is the county seat and most populous city of Travis County, Texas, Travis County, with portions extending into Hays County, Texas, Hays and W ...
. Over the next four years, more than 400 engineers collaborated on the project, with IBM contributing from eleven of its design centers.
Initial
patents
A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an sufficiency of disclosure, enabling discl ...
described a configuration with four
Power Processing Elements (PPEs), each paired with eight Synergistic Processing Elements (SPEs), for a theoretical peak performance of 1 teraFLOPS. However, only a scaled-down design—one PPE with eight SPEs—was ultimately manufactured.
Fabrication of the initial Cell chip began on a
90 nm SOI (
silicon on insulator
In semiconductor manufacturing, silicon on insulator (SOI) technology is fabrication of silicon semiconductor devices in a layered silicon–insulator–silicon substrate, to reduce parasitic capacitance within the device, thereby improving perf ...
) process.
In March 2007, IBM transitioned production to a
65 nm process,
followed by a
45 nm process announced in February 2008.
Bandai Namco Entertainment
is a Japanese multinational corporation, multinational video game video game publisher, publisher, and the video game branch of the wider Bandai Namco Holdings group. Founded in 2006 as it is the successor to Namco's home and arcade video game ...
used the Cell processor in its
Namco System 357 and 369 arcade boards.
In May 2008, IBM introduced the
PowerXCell 8i, a double-precision variant of the Cell processor, used in systems such as IBM's Roadrunner supercomputer, the first to achieve one petaFLOPS and the fastest until late 2009.
IBM ceased development of higher-core-count Cell variants (such as a 32-APU version) in late 2009,
but continued supporting existing Cell-based products.
Commercialization
On May 17, 2005, Sony confirmed the Cell configuration used in the
PlayStation 3
The PlayStation 3 (PS3) is a home video game console developed and marketed by Sony Computer Entertainment (SCE). It is the successor to the PlayStation 2, and both are part of the PlayStation brand of consoles. The PS3 was first released on ...
: one PPE and seven SPEs.
To improve manufacturing
yield, the processor is initially fabricated with eight SPEs. After production,
each chip is tested, and if a defect is found in one SPE, it is disabled using
laser trimming. This approach minimizes waste by utilizing processors that would otherwise be discarded. Even in chips without defects, one SPE is intentionally disabled to ensure consistency across units.
Of the seven operational SPEs, six are available for developers to use in games and applications, while the seventh is reserved for the console's operating system.
The chip operates at a clock speed of 3.2 GHz.
Sony also used the Cell in its
Zego high-performance media computing server.
The PPE supports
simultaneous multithreading (SMT) and can execute two threads, while each active SPE supports one thread. In the PlayStation 3 configuration, the Cell processor supports up to nine threads.
On June 28, 2005, IBM and Mercury Computer Systems announced a partnership to use Cell processors in
embedded systems
An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is em ...
for
medical imaging
Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to revea ...
,
aerospace
Aerospace is a term used to collectively refer to the atmosphere and outer space. Aerospace activity is very diverse, with a multitude of commercial, industrial, and military applications. Aerospace engineering consists of aeronautics and astron ...
, and
seismic processing, among other fields.
Mercury use the full Cell processor with eight active SPEs. Mercury later released
blade servers and
PCI Express
PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe, is a high-speed standard used to connect hardware components inside computers. It is designed to replace older expansion bus standards such as Peripher ...
accelerator cards based on the architecture.
In 2006, IBM introduced the QS20 blade server, offering up to 410 gigaFLOPS per module in single-precision performance. The
QS22 blade, based on the PowerXCell 8i, was used in IBM's Roadrunner supercomputer.
On April 8, 2008, Fixstars Corporation released a PCI Express accelerator board based on the PowerXCell 8i.
Overview
The Cell Broadband Engine, or ''Cell'' as it is more commonly known, is a microprocessor intended as a hybrid of conventional desktop processors (such as the
Athlon 64, and
Core 2 families) and more specialized high-performance processors, such as the
NVIDIA
Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
and
ATI graphics-processors (
GPUs). The longer name indicates its intended use, namely as a component in current and future
online distribution systems; as such it may be utilized in high-definition displays and recording equipment, as well as
HDTV
High-definition television (HDTV) describes a television or video system which provides a substantially higher image resolution than the previous generation of technologies. The term has been used since at least 1933; in more recent times, it ref ...
systems. Additionally the processor may be suited to
digital imaging
Digital imaging or digital image acquisition is the creation of a digital representation of the visual characteristics of an object, such as a physical scene or the interior structure of an object. The term is often assumed to imply or include ...
systems (medical, scientific, ''etc.'') and
physical simulation (''e.g.'', scientific and
structural engineering
Structural engineering is a sub-discipline of civil engineering in which structural engineers are trained to design the 'bones and joints' that create the form and shape of human-made Structure#Load-bearing, structures. Structural engineers also ...
modeling). As used in the PlayStation 3, it has 250 million transistors.
In a simple analysis, the Cell processor can be split into four components: external input and output structures, the main processor called the ''Power Processing Element'' (PPE) (a two-way
simultaneous-multithreaded PowerPC 2.02 core), eight fully functional co-processors called the ''Synergistic Processing Elements'', or SPEs, and a specialized high-bandwidth
circular data bus connecting the PPE, input/output elements and the SPEs, called the ''Element Interconnect Bus'' or EIB.
To achieve the high performance needed for mathematically intensive tasks, such as decoding/encoding
MPEG
The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by International Organization for Standardization, ISO and International Electrotechnical Commission, IEC that sets standards for media coding, includ ...
streams, generating or transforming three-dimensional data, or undertaking
Fourier analysis
In mathematics, Fourier analysis () is the study of the way general functions may be represented or approximated by sums of simpler trigonometric functions. Fourier analysis grew from the study of Fourier series, and is named after Joseph Fo ...
of data, the Cell processor marries the SPEs and the PPE via EIB to give access, via fully
cache coherent DMA (direct memory access), to both main memory and to other external data storage. To make the best of EIB, and to overlap computation and data transfer, each of the nine processing elements (PPE and SPEs) is equipped with a
DMA engine. Since the SPE's load/store instructions can only access its own local
scratchpad memory, each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories. A DMA operation can transfer either a single block area of size up to 16KB or a list of 2 to 2048 such blocks. One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer, with a view to enabling maximal asynchrony and concurrency in data processing inside a chip.
The PPE, which is capable of running a conventional operating system, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. To this end, the PPE has additional instructions relating to the control of the SPEs. Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions. The SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work. As most of the "horsepower" of the system comes from the synergistic processing elements, the use of
DMA as a method of data transfer and the limited local
memory footprint
Memory footprint refers to the amount of main memory that a program uses or references while running.
The word footprint generally refers to the extent of physical dimensions that an object occupies, giving a sense of its size. In computing, t ...
of each SPE pose a major challenge to software developers who wish to make the most of this horsepower, demanding careful hand-tuning of programs to extract maximal performance from this CPU.
The PPE and bus architecture includes various modes of operation, giving different levels of
memory protection, allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.
Both the PPE and SPE are
RISC
In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a comp ...
architectures with a fixed-width 32-bit instruction format. The PPE contains a 64-bit
general-purpose register
A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-onl ...
set (GPR), a 64-bit floating-point register set (FPR), and a 128-bit
Altivec register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 64-bits in size, or for
SIMD
Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
computations on various integer and floating-point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values. Local store addresses internal to the SPU (Synergistic Processor Unit) processor are expressed as a 32-bit word. In documentation relating to Cell, a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.
PowerXCell 8i
In 2008, IBM announced a revised variant of the Cell called the PowerXCell 8i,
which is available in QS22
Blade Servers from IBM. The PowerXCell is manufactured on a
65 nm process, and adds support for up to 32 GB of slotted DDR2 memory, as well as dramatically improving
double-precision floating-point performance on the SPEs from a peak of about 12.8
GFLOPS to 102.4 GFLOPS total for eight SPEs, which, coincidentally, is the same peak performance as the
NEC SX-9 vector processor released around the same time. The
IBM Roadrunner supercomputer, the world's fastest during 2008–2009, consisted of 12,240 PowerXCell 8i processors, along with 6,562
AMD Opteron processors.
The PowerXCell 8i powered super computers also dominated all of the top 6 "greenest" systems in the Green500 list, with highest MFLOPS/Watt ratio supercomputers in the world.
Beside the QS22 and supercomputers, the PowerXCell processor is also available as an accelerator on a PCI Express card and is used as the core processor in the
QPACE project.
Since the PowerXCell 8i removed the RAMBUS memory interface, and added significantly larger DDR2 interfaces and enhanced SPEs, the chip layout had to be reworked, which resulted in both larger chip die and packaging.
Architecture

While the Cell chip can have a number of different configurations, the basic configuration is a
multi-core
A multi-core processor (MCP) is a microprocessor on a single integrated circuit (IC) with two or more separate central processing units (CPUs), called ''cores'' to emphasize their multiplicity (for example, ''dual-core'' or ''quad-core''). Ea ...
chip composed of one "Power Processor Element" ("PPE") (sometimes called "Processing Element", or "PE"), and multiple "Synergistic Processing Elements" ("SPE").
The PPE and SPEs are linked together by an internal high speed bus dubbed "Element Interconnect Bus" ("EIB").
Power Processor Element (PPE)
The ''PPE''
is the
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
based, dual-issue in-order two-way
simultaneous-multithreaded CPU core with a 23-stage pipeline acting as the controller for the eight SPEs, which handle most of the computational workload. PPE has limited out-of-order execution capabilities; it can perform loads out of order and has delayed execution pipelines. The PPE will work with conventional operating systems due to its similarity to other 64-bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32
KiB level 1 instruction
cache, a 32 KiB level 1 data cache, and a 512 KiB level 2 cache. The size of a cache line is 128 bytes in all caches.
Additionally, IBM has included an
AltiVec (VMX) unit
which is fully pipelined for
single precision
Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A floa ...
floating point (Altivec 1 does not support
double precision
Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point arithmetic, floating-point computer number format, number format, usually occupying 64 Bit, bits in computer memory; it represents a wide range of numeri ...
floating-point vectors.), 32-bit
Fixed Point Unit (FXU) with 64-bit register file per thread,
Load and Store Unit (LSU), 64-bit
Floating-Point Unit (FPU),
Branch Unit (BRU) and Branch Execution Unit(BXU).
PPE consists of three main units: Instruction Unit (IU), Execution Unit (XU), and vector/scalar execution unit (VSU). IU contains L1 instruction cache, branch prediction hardware, instruction buffers, and dependency checking logic. XU contains integer execution units (FXU) and load-store unit (LSU). VSU contains all of the execution resources for FPU and VMX. Each PPE can complete two double-precision operations per clock cycle using a scalar fused-multiply-add instruction, which translates to 6.4
GFLOPS at 3.2 GHz; or eight single-precision operations per clock cycle with a vector fused-multiply-add instruction, which translates to 25.6 GFLOPS at 3.2 GHz.
Xenon in Xbox 360
The PPE was designed specifically for the Cell processor, but during development,
Microsoft
Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
approached IBM wanting a high-performance processor core for its
Xbox 360
The Xbox 360 is a home video game console developed by Microsoft. As the successor to the Xbox (console), original Xbox, it is the second console in the Xbox#Consoles, Xbox series. It was officially unveiled on MTV on May 12, 2005, with detail ...
. IBM complied and made the tri-core
Xenon processor, based on a slightly modified version of the PPE with added VMX128 extensions.
Synergistic Processing Element (SPE)
Each SPE is a dual issue in order processor composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (
DMA,
MMU, and
bus interface). SPEs do not have any
branch prediction hardware (hence there is a heavy burden on the compiler).
Each SPE has 6 execution units divided among odd and even pipelines on each SPE : The SPU runs a specially developed
instruction set
In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, s ...
(ISA) with
128-bit
General home computing and gaming utility emerged at 8-bit word sizes, as 28=256 Word (computer architecture), words, a natural unit of data, became possible. Early 8-bit CPUs (such as the Zilog Z80 and MOS Technology 6502, used in the 1977 Co ...
SIMD
Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
organization
for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256
KiB embedded SRAM for instruction and data, called
"Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM) which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4
GiB of local store memory. The local store does not operate like a conventional
CPU cache
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whi ...
since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit, 128-entry
register file and measures 14.5 mm
2 on a 90 nm process. An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.
In one typical usage scenario, the system will load the SPEs with small programs (similar to
threads), chaining the SPEs together to handle each step in a complex operation. For instance, a
set-top box
A set-top box (STB), also known as a cable converter box, cable box, receiver, or simply box, and historically television decoder or a converter, is an information appliance device that generally contains a Tuner (radio)#Television, TV tuner inpu ...
might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6
GFLOPS of single-precision performance.
Compared to its
personal computer
A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
contemporaries, the relatively high overall floating-point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in CPUs like the
Pentium 4
Pentium 4 is a series of single-core central processing unit, CPUs for Desktop computer, desktops, laptops and entry-level Server (computing), servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 20 ...
and the
Athlon 64. However, comparing only floating-point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general-purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature
branch predictors. The Cell is designed to compensate for this with compiler assistance, in which prepare-to-branch instructions are created. For double-precision floating-point operations, as sometimes used in personal computers and often used in scientific computing, Cell performance drops by an order of magnitude, but still reaches 20.8 GFLOPS (1.8 GFLOPS per SPE, 6.4 GFLOPS per PPE). The PowerXCell 8i variant, which was specifically designed for double-precision, reaches 102.4 GFLOPS in double-precision calculations.
Tests by IBM show that the SPEs can reach 98% of their theoretical peak performance running optimized parallel matrix multiplication.
Toshiba
is a Japanese multinational electronics company headquartered in Minato, Tokyo. Its diversified products and services include power, industrial and social infrastructure systems, elevators and escalators, electronic components, semiconductors ...
has developed a
co-processor powered by four SPEs, but no PPE, called the
SpursEngine designed to accelerate 3D and movie effects in consumer electronics.
Each SPE has a local memory of 256 KB. In total, the SPEs have 2 MB of local memory.
Element Interconnect Bus (EIB)
The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants in the PS3 (the number of SPU can vary in industrial applications). The EIB also includes an arbitration unit, which functions as a set of traffic lights. In some documents, IBM refers to EIB participants as 'units'.
The EIB is presently implemented as a circular ring consisting of four 16-byte-wide unidirectional channels that counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum
concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96 bytes per clock (12 concurrent transactions × 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature, it is unrealistic to simply scale this number by processor clock speed. The arbitration unit
imposes additional constraints.
IBM Senior Engineer
David Krolak, EIB lead designer, explains the concurrency model:
Each participant on the EIB has one 16-byte read port and one 16-byte write port. The limit for a single participant is to read and write at a rate of 16 bytes per EIB clock (for simplicity often regarded 8 bytes per system clock). Each SPU processor contains a dedicated
DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.
Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency.
Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.
David Krolak explained:
Bandwidth assessment
At 3.2 GHz, each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300 GB/s". This number reflects the peak instantaneous EIB bandwidth scaled by processor frequency.
However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explained:
This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.
In practice, effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.
To add further to the confusion, some older publications cite EIB bandwidth assuming a 4 GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384 GB/s and an arbitration-limited bandwidth figure of 256 GB/s.
All things considered the theoretic 204.8 GB/s number most often cited is the best one to bear in mind. The ''IBM Systems Performance'' group has demonstrated SPU-centric data flows achieving 197 GB/s on a Cell processor running at 3.2 GHz so this number is a fair reflection on practice as well.
Memory and I/O controllers
Cell contains a dual channel
Rambus XIO macro which interfaces to Rambus
XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32-bit channels can provide a theoretical maximum of 25.6 GB/s.
The I/O interface, also a Rambus design, is known as
FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.
Applications
Video processing card
Some companies, such as
Leadtek, have released
PCI-E cards based upon the Cell to allow for "faster than real time" transcoding of
H.264,
MPEG-2 and
MPEG-4
MPEG-4 is a group of international standards for the compression of digital audio and visual data, multimedia systems, and file storage formats. It was originally introduced in late 1998 as a group of audio and video coding formats and related ...
video.
Blade server
On August 29, 2007, IBM announced the
BladeCenter QS21. Generating a measured 1.05 giga–floating point operations per second (gigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 tera–floating point operations per second (teraFLOPS) and over 25.8 teraFLOPS in a standard 42U rack.
On May 13, 2008, IBM announced the
BladeCenter QS22. The QS22 introduces the PowerXCell 8i processor with five times the double-precision floating point performance of the QS21, and the capacity for up to 32 GB of DDR2 memory on-blade.
IBM has discontinued the Blade server line based on Cell processors as of January 12, 2012.
PCI Express board
Several companies provide PCI-e boards utilising the IBM PowerXCell 8i. The performance is reported as 179.2 GFlops (SP), 89.6 GFlops (DP) at 2.8 GHz.
Console video games
Sony
is a Japanese multinational conglomerate (company), conglomerate headquartered at Sony City in Minato, Tokyo, Japan. The Sony Group encompasses various businesses, including Sony Corporation (electronics), Sony Semiconductor Solutions (i ...
's
PlayStation 3
The PlayStation 3 (PS3) is a home video game console developed and marketed by Sony Computer Entertainment (SCE). It is the successor to the PlayStation 2, and both are part of the PlayStation brand of consoles. The PS3 was first released on ...
video game console
A video game console is an electronic device that Input/output, outputs a video signal or image to display a video game that can typically be played with a game controller. These may be home video game console, home consoles, which are generally ...
was the first production application of the Cell processor, clocked at 3.2
GHz
The hertz (symbol: Hz) is the unit of frequency in the International System of Units (SI), often described as being equivalent to one event (or Cycle per second, cycle) per second. The hertz is an SI derived unit whose formal expression in ter ...
and containing seven out of eight operational SPEs, to allow Sony to increase the
yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS.
Home cinema

Toshiba has produced
HDTVs using Cell. They presented a system to decode 48
standard definition MPEG-2 streams simultaneously on a
1920×1080 screen.
This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.
Laptop PCs
Toshiba produced a laptop,
Qosmio G55, released in 2008, that contains Cell technology embedded into it. Its CPU otherwise is an
Intel Core
Intel Core is a line of multi-core (with the exception of Core Solo and Core 2 Solo) central processing units (CPUs) for midrange, embedded, workstation, high-end and enthusiast computer markets marketed by Intel Corporation. These processors ...
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
-based chip as is common on
Toshiba computers.
Supercomputing
IBM's supercomputer,
IBM Roadrunner, was a hybrid of General Purpose x86-64
Opteron as well as Cell processors. This system assumed the #1 spot on the June 2008 Top 500 list as the first supercomputer to run at
petaFLOPS
Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.
For such cases, it is a more accurate measu ...
speeds, having gained a sustained 1.026 petaFLOPS speed using the standard
LINPACK benchmark. IBM Roadrunner used the PowerXCell 8i version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128-bit registers, reaching double precision 102 GFLOPs per chip.
Cluster computing
Clusters of
PlayStation 3
The PlayStation 3 (PS3) is a home video game console developed and marketed by Sony Computer Entertainment (SCE). It is the successor to the PlayStation 2, and both are part of the PlayStation brand of consoles. The PS3 was first released on ...
consoles are an attractive alternative to high-end systems based on Cell blades. Innovative Computing Laboratory, a group led by
Jack Dongarra, in the Computer Science Department at the University of Tennessee, investigated such an application in depth.
Terrasoft Solutions is selling 8-node and 32-node PS3 clusters with
Yellow Dog Linux pre-installed, an implementation of Dongarra's research.
As first reported by ''
Wired
Wired may refer to:
Arts, entertainment, and media Music
* ''Wired'' (Jeff Beck album), 1976
* ''Wired'' (Hugh Cornwell album), 1993
* ''Wired'' (Mallory Knox album), 2017
* "Wired", a song by Prism from their album '' Beat Street''
* "Wired ...
'' on October 17, 2007, an interesting application of using PlayStation 3 in a cluster configuration was implemented by Astrophysicist
Gaurav Khanna, from the Physics department of
University of Massachusetts Dartmouth, who replaced time used on supercomputers with a cluster of eight PlayStation 3s. Subsequently, the next generation of this machine, now called the ''
PlayStation 3
The PlayStation 3 (PS3) is a home video game console developed and marketed by Sony Computer Entertainment (SCE). It is the successor to the PlayStation 2, and both are part of the PlayStation brand of consoles. The PS3 was first released on ...
Gravity Grid'', uses a network of 16 machines, and exploits the Cell processor for the intended application which is binary
black hole
A black hole is a massive, compact astronomical object so dense that its gravity prevents anything from escaping, even light. Albert Einstein's theory of general relativity predicts that a sufficiently compact mass will form a black hole. Th ...
coalescence using
perturbation theory
In mathematics and applied mathematics, perturbation theory comprises methods for finding an approximate solution to a problem, by starting from the exact solution of a related, simpler problem. A critical feature of the technique is a middle ...
. In particular, the cluster performs astrophysical simulations of large
supermassive black hole
A supermassive black hole (SMBH or sometimes SBH) is the largest type of black hole, with its mass being on the order of hundreds of thousands, or millions to billions, of times the mass of the Sun (). Black holes are a class of astronomical ...
s capturing smaller compact objects and has generated numerical data that has been published multiple times in the relevant scientific research literature. The Cell processor version used by the PlayStation 3 has a main CPU and 6 SPEs available to the user, giving the Gravity Grid machine a net of 16 general-purpose processors and 96 vector processors. The machine has a one-time cost of $9,000 to build and is adequate for black-hole simulations which would otherwise cost $6,000 per run on a conventional supercomputer. The black hole calculations are not memory-intensive and are highly localizable, and so are well-suited to this architecture. Khanna claims that the cluster's performance exceeds that of a 100+ Intel Xeon core based traditional Linux cluster on his simulations. The PS3 Gravity Grid gathered significant media attention through 2007, 2008, 2009, and 2010.
The computational Biochemistry and Biophysics lab at the
Universitat Pompeu Fabra, in
Barcelona
Barcelona ( ; ; ) is a city on the northeastern coast of Spain. It is the capital and largest city of the autonomous community of Catalonia, as well as the second-most populous municipality of Spain. With a population of 1.6 million within c ...
, deployed in 2007 a
BOINC system called
PS3GRID for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.
The United States
Air Force Research Laboratory
The Air Force Research Laboratory (AFRL) is a scientific research and development detachment of the United States Air Force Air Force Materiel Command, Materiel Command dedicated to leading the discovery, development, and integration of direct- ...
has deployed a PlayStation 3 cluster of over 1700 units, nicknamed the "Condor Cluster", for analyzing
high-resolution satellite imagery
Satellite images (also Earth observation imagery, spaceborne photography, or simply satellite photo) are images of Earth collected by imaging satellites operated by governments and businesses around the world. Satellite imaging companies sell im ...
. The Air Force claims the Condor Cluster would be the 33rd largest supercomputer in the world in terms of capacity. The lab has opened up the supercomputer for use by universities for research.
Distributed computing
With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project
Folding@home has been recognized by ''
Guinness World Records
''Guinness World Records'', known from its inception in 1955 until 1999 as ''The Guinness Book of Records'' and in previous United States editions as ''The Guinness Book of World Records'', is a British reference book published annually, list ...
'' as the most powerful distributed network in the world. The first record was achieved on September 16, 2007, as the project surpassed one
petaFLOPS
Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.
For such cases, it is a more accurate measu ...
, which had never previously been attained by a distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the petaFLOPS mark on September 23, 2007. In comparison, the world's second-most powerful supercomputer at the time, IBM's
Blue Gene/L, performed at around 478.2 teraFLOPS, which means Folding@home's computing power is approximately twice Blue Gene/L's (although the CPU interconnect in Blue Gene/L is more than one million times faster than the mean network speed in Folding@home). As of May 7, 2011, Folding@home runs at about 9.3 x86 petaFLOPS, with 1.6 petaFLOPS generated by 26,000 active PS3s alone.
Mainframes
IBM announced on April 25, 2007, that it would begin integrating its Cell Broadband Engine Architecture microprocessors into the company's
System z line of mainframes. This has led to a
gameframe.
Password cracking
The architecture of the processor makes it better suited to hardware-assisted cryptographic
brute-force attack applications than conventional processors.
Software engineering
Due to the flexible nature of the Cell, there are several possibilities for the utilization of its resources, not limited to just different computing paradigms:
Job queue
The PPE maintains a job queue, schedules jobs in SPEs, and monitors progress. Each SPE runs a "mini kernel" whose role is to fetch a job, execute it, and synchronize with the PPE.
Self-multitasking of SPEs
The mini kernel and scheduling is distributed across the SPEs. Tasks are synchronized using
mutexes or
semaphores as in a conventional
operating system
An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ...
. Ready-to-run tasks wait in a queue for an SPE to execute them. The SPEs use shared memory for all tasks in this configuration.
Stream processing
Each SPE runs a distinct program. Data comes from an input stream and is sent to SPEs. When an SPE has terminated the processing, the output data is sent to an output stream.
This provides a flexible and powerful architecture for
stream processing
In computer science, stream processing (also known as event stream processing, data stream processing, or distributed stream processing) is a programming paradigm which views Stream (computing), streams, or sequences of events in time, as the centr ...
, and allows explicit scheduling for each SPE separately. Other processors are also able to perform streaming tasks but are limited by the kernel loaded.
Open source software development
In 2005, patches enabling Cell support in the Linux kernel were submitted for inclusion by IBM developers. Arnd Bergmann (one of the developers of the aforementioned patches) also described the Linux-based Cell architecture at
LinuxTag 2005.
As of release 2.6.16 (March 20, 2006), the Linux kernel officially supports the Cell processor.
Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries.
Fixstars Solutions provides
Yellow Dog Linux for IBM and Mercury Cell-based systems, as well as for the PlayStation 3. Terra Soft strategically partnered with Mercury to provide a Linux Board Support Package for Cell, and support and development of software applications on various other Cell platforms, including the IBM BladeCenter JS21 and Cell QS20, and Mercury Cell-based solutions. Terra Soft also maintains the Y-HPC (High Performance Computing) Cluster Construction and Management Suite and Y-Bio gene sequencing tools. Y-Bio is built upon the RPM Linux standard for package management, and offers tools which help bioinformatics researchers conduct their work with greater efficiency. IBM has developed a pseudo-filesystem for Linux coined "Spufs" that simplifies access to and use of the SPE resources. IBM is currently maintaining a Linux
kernel and
GDB ports, while Sony maintains the
GNU toolchain (
GCC,
binutils).
In November 2005, IBM released a "Cell Broadband Engine (CBE) Software Development Kit Version 1.0", consisting of a simulator and assorted tools, to its web site. Development versions of the latest kernel and tools for
Fedora Core 4 are maintained at the
Barcelona Supercomputing Center website.
In August 2007, Mercury Computer Systems released a Software Development Kit for PlayStation 3 for High-Performance Computing.
In November 2007, Fixstars Corporation released the new "CVCell" module aiming to accelerate several important
OpenCV APIs for Cell. In a series of software calculation tests, they recorded execution times on a 3.2 GHz Cell processor that were between 6x and 27x faster compared with the same software on a 2.4 GHz Intel Core 2 Duo.
In October 2009, IBM released an
OpenCL
OpenCL (Open Computing Language) is a software framework, framework for writing programs that execute across heterogeneous computing, heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), di ...
driver for POWER6 and CBE. This allows programs written in the cross-platform API to be easily run on Cell PSE.
Gallery
Illustrations of the different generations of Cell/B.E. processors and the PowerXCell 8i. The images are not to scale; All Cell/B.E. packages measures 42.5×42.5 mm and the PowerXCell 8i measures 47.5×47.5 mm.
File:Cell-BE-90nm-lid.jpg, The 90 nm Cell/B.E. that shipped with the first PlayStation 3. The usual way one would see it is with its lid on, as it is glued on and not easily removed.
File:Cell-BE-90nm.jpg, The 90 nm Cell/B.E. that shipped with the first PlayStation 3. It has its lid removed to show the size of the processor die underneath.
File:Cell-BE-90-underside.jpg, The underside of the 90 nm Cell/B.E. processor showing its 1242 solder balls, each 0.6 mm in diameter, and its array of 35 capacitors
File:Cell-BE-65nm.jpg, The 65 nm Cell/B.E. that shipped with updated PlayStation 3s. It has its lid removed to show the size of the processor die underneath.
File:Cell-BE-45nm.jpg, The 45 nm Cell/B.E. that shipped with updated PlayStation 3s such as the Slim and Super Slim versions. It has its lid removed to show the size of the processor die underneath.
File:PowerXCell-8i.jpg, The 65 nm high-performance PowerXCell 8i with extra capacitors on top due to decoupling needed for noise introduced by the DDR2 interface
See also
*
STI Center of Competence for the Cell Processor
*
Adapteva Epiphany architecture, a similar network-on-a-chip with local stores and DMA, but more cores and easier off-core communication.
*
Vision Processing Unit, an emerging class of processor with some similar features
*
Multiprocessor system on a chip
*
Cell software development
*
Xenon (processor)
*
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
Notes
References
External links
Cell Broadband Engine resource centerSony Computer Entertainment Incorporated's Cell resource pageCmpware Configurable Multiprocessor Development Kit for Cell BEISSCC 2005: The CELL Microprocessor, a comprehensive overview of the CELL microarchitectureIntroducing the IBM/Sony/Toshiba Cell Processor — Part I: the SIMD processing unitsIntroducing the IBM/Sony/Toshiba Cell Processor -- Part II: The Cell ArchitectureThe Soul of Cell: An interview with Dr. H. Peter Hofstee
{{DEFAULTSORT:Cell (Microprocessor)
*
IBM microprocessors
PowerPC microprocessors
SIMD computing
Sony semiconductors
Power microprocessors
64-bit microprocessors