The QCDOC (
quantum chromodynamics
In theoretical physics, quantum chromodynamics (QCD) is the study of the strong interaction between quarks mediated by gluons. Quarks are fundamental particles that make up composite hadrons such as the proton, neutron and pion. QCD is a type of ...
on a chip) is a
supercomputer
A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...
technology focusing on using relatively cheap
low power processing elements to produce a
massively parallel
Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of ...
machine. The machine is custom-made to solve small but extremely demanding problems in the fields of
quantum physics
Quantum mechanics is the fundamental physical Scientific theory, theory that describes the behavior of matter and of light; its unusual characteristics typically occur at and below the scale of atoms. Reprinted, Addison-Wesley, 1989, It is ...
.
Overview
The computers were designed and built jointly by
University of Edinburgh
The University of Edinburgh (, ; abbreviated as ''Edin.'' in Post-nominal letters, post-nominals) is a Public university, public research university based in Edinburgh, Scotland. Founded by the City of Edinburgh Council, town council under th ...
(UKQCD),
Columbia University
Columbia University in the City of New York, commonly referred to as Columbia University, is a Private university, private Ivy League research university in New York City. Established in 1754 as King's College on the grounds of Trinity Churc ...
, the
RIKEN
is a national scientific research institute in Japan. Founded in 1917, it now has about 3,000 scientists on seven campuses across Japan, including the main site at Wakō, Saitama, Wakō, Saitama Prefecture, on the outskirts of Tokyo. Riken is a ...
BNL Brookhaven Research Center and
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
. The purpose of the collaboration was to exploit computing facilities for
lattice field theory
In physics, lattice field theory is the study of lattice models of quantum field theory. This involves studying field theory on a space or spacetime that has been discretised onto a lattice.
Details
Although most lattice field theories are not ...
calculations whose primary aim is to increase the predictive power of the
Standard Model
The Standard Model of particle physics is the Scientific theory, theory describing three of the four known fundamental forces (electromagnetism, electromagnetic, weak interaction, weak and strong interactions – excluding gravity) in the unive ...
of elementary particle interactions through numerical simulation of quantum chromodynamics (QCD). The target was to build a massively parallel supercomputer able to peak at 10
Tflops
Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.
For such cases, it is a more accurate measur ...
with sustained power at 50% capacity.
There are three QCDOCs in service each reaching 10
Tflops
Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.
For such cases, it is a more accurate measur ...
peak operation.
*
University of Edinburgh
The University of Edinburgh (, ; abbreviated as ''Edin.'' in Post-nominal letters, post-nominals) is a Public university, public research university based in Edinburgh, Scotland. Founded by the City of Edinburgh Council, town council under th ...
's Parallel Computing Centre (
EPCC
EPCC, formerly the Edinburgh Parallel Computing Centre, is a supercomputing centre based at the University of Edinburgh. Since its foundation in 1990, its stated mission has been to ''accelerate the effective exploitation of novel computing th ...
). In operation by the UKQCD since 2005
* RIKEN BNL Brookhaven Research Center at
Brookhaven National Laboratory
Brookhaven National Laboratory (BNL) is a United States Department of Energy national laboratories, United States Department of Energy national laboratory located in Upton, New York, a hamlet of the Brookhaven, New York, Town of Brookhaven. It w ...
*
U.S. Department of Energy
The United States Department of Energy (DOE) is an executive department of the U.S. federal government that oversees U.S. national energy policy and energy production, the research and development of nuclear power, the military's nuclear we ...
Program in High Energy and Nuclear Physics at Brookhaven National Laboratory
Around 23
UK academic staff, their postdocs and students, from seven universities, belong to UKQCD. Costs were funded through a Joint Infrastructure Fund Award of £6.6 million. Staff costs (system support, physicist programmers and postdocs) are around £1 million per year, other computing and operating costs are around £0.2 million per year.
QCDOC was to replace an earlier design, QCDSP, where the power came from connecting large amounts of
DSPs together in a similar fashion. The QCDSP strapped 12.288 nodes to a 4D network and reached 1 Tflops in 1998.
QCDOC can be seen as a predecessor to the highly successful
Blue Gene/L
Blue Gene was an IBM project aimed at designing supercomputers that can reach operating speeds in the petaFLOPS (PFLOPS) range, with relatively low power consumption.
The project created three generations of supercomputers, Blue Gene/L, Blue ...
supercomputer. They share a lot of design traits, and similarities go beyond superficial characteristics. Blue Gene is also a massively parallel supercomputer built with a large amount of cheap, relatively weak
PowerPC 440
The PowerPC 400 family is a line of 32-bit embedded RISC processor cores based on the PowerPC or Power ISA instruction set architectures. The cores are designed to fit inside specialized applications ranging from system-on-a-chip (SoC) microcon ...
based
SoC
SOC, SoC, Soc, may refer to:
Science and technology
* Information security operations center, in an organization, a centralized unit that deals with computer security issues
* Selectable output control
* Separation of concerns, a program design pr ...
nodes connected with a high bandwidth multidimensional mesh. They differ, however, in that the computing nodes in BG/L are more powerful and are connected with a faster, more sophisticated network that scales up to several hundred thousand nodes per system.
Architecture
Computing node
The computing nodes are custom
ASIC
An application-specific integrated circuit (ASIC ) is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficien ...
s with about fifty million transistors each. They are mainly made up of existing building blocks from
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
. They are built around a 500 MHz
PowerPC 440
The PowerPC 400 family is a line of 32-bit embedded RISC processor cores based on the PowerPC or Power ISA instruction set architectures. The cores are designed to fit inside specialized applications ranging from system-on-a-chip (SoC) microcon ...
core with 4 MB
DRAM
Dram, DRAM, or drams may refer to:
Technology and engineering
* Dram (unit), a unit of mass and volume, and an informal name for a small amount of liquor, especially whisky or whiskey
* Dynamic random-access memory, a type of electronic semicondu ...
, memory management for external
DDR SDRAM
Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM) is a double data rate (DDR) synchronous dynamic random-access memory (SDRAM) class of memory integrated circuits used in computers. DDR SDRAM, also retroactively called DDR ...
, system I/O for internode communications, and dual Ethernet built in. The computing node is capable of 1 double precision
Gflops
Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.
For such cases, it is a more accurate measu ...
. Each node has one
DIMM
A DIMM (Dual In-line Memory Module) is a popular type of memory module used in computers. It is a printed circuit board with one or both sides (front and back) holding DRAM chips and pins. The vast majority of DIMMs are manufactured in compl ...
socket capable of holding between 128 and 2048 MB of 333 MHz
ECC DDR SDRAM
Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM) is a double data rate (DDR) synchronous dynamic random-access memory (SDRAM) class of memory integrated circuits used in computers. DDR SDRAM, also retroactively called DDR ...
.
Inter node communication
Each node has the capability to send and receive data from each of its twelve nearest neighbors in a six-dimensional mesh at a rate of 500 Mbit/s each. This provides a total off-node bandwidth of 12 Gbit/s. Each of these 24 channels has
DMA to the other nodes' on-chip DRAM or the external SDRAM. In practice only four dimensions will be used to form a communications sub-torus where the remaining two dimensions will be used to partition the system.
The operating system communicates with the computing nodes using the Ethernet network. This is also used for diagnostics, configuration and communications with disk storage.
Mechanical design
Two nodes are placed together on a daughter card with one DIMM socket and a 4:1 Ethernet hub for off-card communications. The daughter cards have two connectors, one carrying the internode communications network and one carrying power, Ethernet, clock and other house keeping facilities.
Thirty-two daughter cards are placed in two rows on a motherboard that supports 800 Mbit/s off-board Ethernet communications. Eight motherboards are placed in crates with two backplanes supporting four motherboards each. Each crate consists of 512 processor nodes a and a 2
6 hypercube communications network. One node consumes about 5 W of power, and each crate is air and water cooled. A complete system can consist of any number of crates, for a total of up to several tens of thousands of nodes.
Operating system
The QCDOC runs a custom-built operating system, QOS, which facilitates boot, runtime, monitoring, diagnostics, and performance and simplifies management of the large number of computing nodes. It uses a custom embedded
kernel
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learnin ...
and provides single process
POSIX
The Portable Operating System Interface (POSIX; ) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines application programming interfaces (APIs), along with comm ...
("unix-like") compatibility using the Cygnus
newlib
Newlib is a C standard library implementation intended for use on embedded systems. It is a conglomeration of several library parts, all under free software licenses that make them easily usable on embedded products.
It was created by Cygnus ...
library. The kernel includes a specially written
UDP/
IP stack and
NFS client for disk access.
The operating system also maintains system partitions so several users can have access to separate parts of the system for different applications. Each partition will only run one client application at any given time. Any multitasking is scheduled by the host controller system which is a regular computer using a large amounts of Ethernet ports connecting to the QCDOC.
See also
*
Norman Christ
Norman Howard Christ (; born 22 December 1943 in Pittsburgh) is a physicist and professor at Columbia University, where he holds the Ephraim Gildor Professorship of Computational Theoretical Physics. He is notable for his research in Lattice QCD. ...
*
PowerPC 440
The PowerPC 400 family is a line of 32-bit embedded RISC processor cores based on the PowerPC or Power ISA instruction set architectures. The cores are designed to fit inside specialized applications ranging from system-on-a-chip (SoC) microcon ...
*
Blue Gene/L
Blue Gene was an IBM project aimed at designing supercomputers that can reach operating speeds in the petaFLOPS (PFLOPS) range, with relatively low power consumption.
The project created three generations of supercomputers, Blue Gene/L, Blue ...
*
QPACE
*
Supercomputer
A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...
References
Computational Quantum Field Theory at Columbia – Columbia University
UKQCD – Science and Technology Facilities CouncilQCDOC: A 10 Teraflops Computer for Tightly-coupled Calculations(
BNL)
UK supercomputer probes secrets of universe The Register
IBM QPACE (TOP500) Softpedia
{{DEFAULTSORT:Qcdoc
Computer science institutes in the United Kingdom
Parallel computing
University of Edinburgh School of Informatics
IBM supercomputer platforms