Tesla Dojo is a

supercomputer A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...

designed and built by Tesla for

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

video processing and recognition. It is used for training Tesla's

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

models to improve its Full Self-Driving (FSD)

advanced driver-assistance system Advanced driver-assistance systems (ADAS) are technologies that assist drivers with the safe operation of a vehicle. Through a human-machine interface, ADAS increases car and road safety. ADAS uses automated technology, such as sensors and came ...

. According to Tesla, it went into production in July 2023. Dojo's goal is to efficiently process millions of

terabytes The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...

of video data captured from real-life driving situations from Tesla's 4+ million cars. This goal led to a considerably different architecture than conventional supercomputer designs.

History

Tesla operates several massively

parallel computing Parallel computing is a type of computing, computation in which many calculations or Process (computing), processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. ...

clusters for developing its

Autopilot An autopilot is a system used to control the path of a vehicle without requiring constant manual control by a human operator. Autopilots do not replace human operators. Instead, the autopilot assists the operator's control of the vehicle, allow ...

advanced driver assistance system. Its primary unnamed cluster using 5,760

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

A100

graphics processing units A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal co ...

(GPUs) was touted by

Andrej Karpathy Andrej Karpathy (born 23 October 1986) is a Slovak-Canadian computer scientist who served as the director of artificial intelligence and Autopilot Vision at Tesla. He co-founded and formerly worked at OpenAI, where he specialized in deep lear ...

in 2021 at the fourth

International Joint Conference on Computer Vision and Pattern Recognition International is an adjective (also used as a noun) meaning "between nations". International may also refer to: Music Albums * ''International'' (Kevin Michael album), 2011 * ''International'' (New Order album), 2002 * ''International'' (The T ...

(CCVPR 2021) to be "roughly the number five supercomputer in the world" at approximately 81.6

petaflops Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measu ...

, based on scaling the performance of the Nvidia Selene supercomputer, which uses similar components. However, the performance of the primary Tesla GPU cluster has been disputed, as it was not clear if this was measured using single-precision or double-precision floating point numbers (

FP32 Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floati ...

FP64 Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point. Double prec ...

). Tesla also operates a second 4,032 GPU cluster for training and a third 1,752 GPU cluster for automatic labeling of objects. The primary unnamed Tesla GPU cluster has been used for processing one million video clips, each ten seconds long, taken from Tesla Autopilot cameras operating in Tesla cars in the real world, running at 36

frames per second A frame is often a structural system that supports other components of a physical construction and/or steel frame that limits the construction's extent. Frame and FRAME may also refer to: Physical objects In building construction *Framing (co ...

. Collectively, these video clips contained six billion object labels, with depth and velocity data; the total size of the data set was 1.5

petabytes The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...

. This data set was used for training a

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

intended to help Autopilot computers in Tesla cars understand roads. By August 2022, Tesla had upgraded the primary GPU cluster to 7,360 GPUs. Dojo was first mentioned by

Elon Musk Elon Reeve Musk ( ; born June 28, 1971) is a businessman. He is known for his leadership of Tesla, SpaceX, X (formerly Twitter), and the Department of Government Efficiency (DOGE). Musk has been considered the wealthiest person in th ...

in April 2019 during Tesla's "Autonomy Investor Day". In August 2020, Musk stated it was "about a year away" due to power and thermal issues. Dojo was officially announced at Tesla's Artificial Intelligence (AI) Day on August 19, 2021. Tesla revealed details of the D1 chip and its plans for "Project Dojo", a datacenter that would house 3,000 D1 chips; the first "Training Tile" had been completed and delivered the week before. In October 2021, Tesla released a "Dojo Technology"

whitepaper A white paper is a report or guide that informs readers concisely about a complex issue and presents the issuing body's philosophy on the matter. It is meant to help readers understand an issue, solve a problem, or make a decision. Since the 199 ...

describing the Configurable

Float8 Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point. Double prec ...

(CFloat8) and Configurable Float16 (CFloat16) floating point formats and arithmetic operations as an extension of

Institute of Electrical and Electronics Engineers The Institute of Electrical and Electronics Engineers (IEEE) is an American 501(c)(3) public charity professional organization for electrical engineering, electronics engineering, and other related disciplines. The IEEE has a corporate office ...

(IEEE) standard

754 __NOTOC__ Year 754 ( DCCLIV) was a common year starting on Tuesday of the Julian calendar, the 754th year of the Common Era (CE) and Anno Domini (AD) designations, the 754th year of the 1st millennium, the 54th year of the 8th century, and the ...

. At the follow-up AI Day in September 2022, Tesla announced it had built several System Trays and one Cabinet. During a test, the company stated that Project Dojo drew 2.3

megawatts The watt (symbol: W) is the unit of power or radiant flux in the International System of Units (SI), equal to 1 joule per second or 1 kg⋅m2⋅s−3. It is used to quantify the rate of energy transfer. The watt is named in honor o ...

(MW) of power before tripping a local San Jose, California power substation. At the time, Tesla was assembling one Training Tile per day. In August 2023, Tesla powered on Dojo for production use as well as a new training cluster configured with 10,000 Nvidia H100 GPUs. In January 2024, Musk described Dojo as "a long shot worth taking because the payoff is potentially very high. But it's not something that is a high probability." In June 2024, Musk explained that ongoing construction work at

Gigafactory Texas Gigafactory Texas (also known as Giga Texas, Giga Austin, or Gigafactory 5) is a Tesla, Inc. automotive manufacturing facility in unincorporated Travis County, Texas, just outside of Austin. Construction began in July 2020, limited production o ...

is for a computing cluster claiming that it is planned to comprise an even mix of "Tesla AI" and Nvidia/other hardware with a total thermal design power of at first 130 MW and eventually exceeding 500 MW.

Reception

Various analysts have stated Dojo "is impressive, but it won't transform supercomputing", "is a game-changer because it has been developed completely in-house", "will massively accelerate the development of autonomous vehicles", and "could be a game changer for the future of Tesla FSD and for AI more broadly." On September 11, 2023,

Morgan Stanley Morgan Stanley is an American multinational investment bank and financial services company headquartered at 1585 Broadway in Midtown Manhattan, New York City. With offices in 42 countries and more than 80,000 employees, the firm's clients in ...

increased its target price for Tesla stock ( TSLA) to US$400 from a prior target of $250 and called the stock its top pick in the electric vehicle sector, stating that Tesla’s Dojo supercomputer could fuel a $500 billion jump in Tesla’s market value.

Technical architecture

The fundamental unit of the Dojo supercomputer is the D1 chip, designed by a team at Tesla led by ex-

AMD Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...

CPU A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, log ...

designer

Ganesh Venkataramanan Ganesha or Ganesh (, , ), also known as Ganapati, Vinayaka and Pillaiyar, is one of the best-known and most worshipped deities in the Hindu pantheon and is the Supreme God in the Ganapatya sect. His depictions are found throughout India. Hind ...

, including Emil Talpes, Debjit Das Sarma, Douglas Williams, Bill Chang, and Rajiv Kurian. The D1 chip is manufactured by the

Taiwan Semiconductor Manufacturing Company Taiwan Semiconductor Manufacturing Company Limited (TSMC or Taiwan Semiconductor) is a Taiwanese multinational semiconductor contract manufacturing and design company. It is one of the world's most valuable semiconductor companies, the world' ...

(TSMC) using 7

nanometer 330px, Different lengths as in respect to the Molecule">molecular scale. The nanometre (international spelling as used by the International Bureau of Weights and Measures; SI symbol: nm), or nanometer (American spelling Despite the va ...

(nm) semiconductor nodes, has 50 billion

transistors A transistor is a semiconductor device used to Electronic amplifier, amplify or electronic switch, switch electrical signals and electric power, power. It is one of the basic building blocks of modern electronics. It is composed of semicondu ...

and a large die size of . Updating at Artificial Intelligence (AI) Day in 2022, Tesla announced that Dojo would scale by deploying multiple ExaPODs, in which there would be: * 10 Cabinets per ''ExaPOD'' ( cores, D1 chips) * 2 System Trays per ''Cabinet'' ( cores, D1 chips) * 6 Training Tiles per ''System Tray'' ( cores, along with host interface hardware) * 25 D1 chips per ''Training Tile'' ( cores) * 354 computing cores per ''D1'' chip Tesla Dojo architecture

According to Venkataramanan, Tesla's senior director of Autopilot hardware, Dojo will have more than an

exaflop Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measur ...

(a million teraflops) of computing power. For comparison, according to Nvidia, in August 2021, the (pre-Dojo) Tesla AI-training center used 720 nodes, each with eight Nvidia A100 Tensor Core GPUs for 5,760 GPUs in total, providing up to 1.8 exaflops of performance.

D1 chip

Each node (computing core) of the D1 processing chip is a general purpose

64-bit In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit central processing units (CPU) and arithmetic logic units (ALU) are those that are based on processor registers, a ...

CPU with a

superscalar A superscalar processor (or multiple-issue processor) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single in ...

core. It supports internal instruction-level parallelism, and includes

simultaneous multithreading Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern proces ...

(SMT). It doesn't support

virtual memory In computing, virtual memory, or virtual storage, is a memory management technique that provides an "idealized abstraction of the storage resources that are actually available on a given machine" which "creates the illusion to users of a ver ...

and uses limited memory protection mechanisms. Dojo software/applications manage chip resources. Dojo node uarch

The D1 instruction set supports both 64-bit scalar and 64-byte

single instruction, multiple data Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...

(SIMD) vector instructions. The integer unit mixes

reduced instruction set computer In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a com ...

(

RISC-V RISC-V (pronounced "risk-five") is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. The project commenced in 2010 at the University of California, Berkeley. It transfer ...

) and custom instructions, supporting 8, 16, 32, or 64 bit integers. The custom vector math unit is optimized for machine learning kernels and supports multiple data formats, with a mix of precisions and numerical ranges, many of which are compiler composable. Up to 16 vector formats can be used simultaneously.

Node

Each D1 node uses a 32-byte fetch window holding up to eight instructions. These instructions are fed to an eight-wide decoder which supports two threads per cycle, followed by a four-wide, four-way SMT scalar scheduler that has two integer units, two address units, and one register file per thread. Vector instructions are passed further down the pipeline to a dedicated vector scheduler with two-way SMT, which feeds either a 64-byte SIMD unit or four 8×8×4 matrix multiplication units. The network on-chip (NOC) router links cores into a two-dimensional mesh network. It can send one packet in and one packet out in all four directions to/from each neighbor node, along with one 64-byte read and one 64-byte write to local SRAM per clock cycle. Hardware native operations transfer data, semaphores and barrier constraints across memories and CPUs. System-wide double data rate 4 (DDR4)

synchronous dynamic random-access memory Synchronous dynamic random-access memory (synchronous dynamic RAM or SDRAM) is any DRAM where the operation of its external pin interface is coordinated by an externally supplied clock signal. DRAM integrated circuits (ICs) produced from the ...

(SDRAM) memory works like bulk storage.

Memory

Each core has a 1.25

megabytes The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes ...

(MB) of SRAM main memory. Load and store speeds reach 400

gigabytes The gigabyte () is a multiple of the unit byte for digital information. The prefix ''giga'' means 109 in the International System of Units (SI). Therefore, one gigabyte is one billion bytes. The unit symbol for the gigabyte is GB. This defin ...

(GB) per second and 270 GB/sec, respectively. The chip has explicit core-to-core data transfer instructions. Each SRAM has a unique list parser that feeds a pair of decoders and a gather engine that feeds the vector register file, which together can directly transfer information across nodes.

Die

Twelve nodes (cores) are grouped into a local block. Nodes are arranged in an 18×20 array on a single die, of which 354 cores are available for applications. The die runs at 2

gigahertz The hertz (symbol: Hz) is the unit of frequency in the International System of Units (SI), often described as being equivalent to one event (or cycle) per second. The hertz is an SI derived unit whose formal expression in terms of SI base un ...

(GHz) and totals 440 MB of SRAM (360 cores × 1.25 MB/core). It reaches 376 teraflops using 16-bit brain floating point (

BF16 The bfloat16 (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bi ...

) numbers or using configurable 8-bit floating point (CFloat8) numbers, which is a Tesla proposal, and 22 teraflops at FP32. Each die comprises 576 bi-directional

serializer/deserializer A Serializer/Deserializer (SerDes) is a pair of functional blocks commonly used in high speed communications to compensate for limited input/output. These blocks convert data between serial data and parallel interfaces in each direction. The ter ...

(

SerDes A Serializer/Deserializer (SerDes) is a pair of functional blocks commonly used in high speed communications to compensate for limited input/output. These blocks convert data between serial data and parallel interfaces in each direction. The ter ...

) channels along the perimeter to link to other dies, and moves 8 TB/sec across all four die edges. Each D1 chip has a thermal design power of approximately 400 watts.

Training Tile

The water-cooled Training Tile packages 25 D1 chips into a 5×5 array. Each tile supports 36 TB/sec of aggregate bandwidth via 40

input/output In computing, input/output (I/O, i/o, or informally io or IO) is the communication between an information processing system, such as a computer, and the outside world, such as another computer system, peripherals, or a human operator. Inputs a ...

(I/O) chips - half the bandwidth of the chip mesh network. Each tile supports 10 TB/sec of on-tile bandwidth. Each tile has 11 GB of SRAM memory (25 D1 chips × 360 cores/D1 × 1.25 MB/core). Each tile achieves 9 petaflops at BF16/CFloat8 precision (25 D1 chips × 376 TFLOP/D1). Each tile consumes 15 kilowatts; 288

amperes The ampere ( , ; symbol: A), often shortened to amp,SI supports only the use of symbols and deprecates the use of abbreviations for units. is the unit of electric current in the International System of Units (SI). One ampere is equal to 1 c ...

at 52

volts The volt (symbol: V) is the unit of electric potential, electric potential difference (voltage), and electromotive force in the International System of Units (SI). Definition One volt is defined as the electric potential between two point ...

System Tray

Six tiles are aggregated into a System Tray, which is integrated with a

host interface A network interface controller (NIC, also known as a network interface card, network adapter, LAN adapter and physical network interface) is a computer hardware component that connects a computer to a computer network. Early network interface ...

. Each host interface includes 512

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...

cores, providing a

Linux Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...

-based user environment. Previously, the Dojo System Tray was known as the Training Matrix, which includes six Training Tiles, 20 Dojo Interface Processor cards across four host servers, and Ethernet-linked adjunct servers. It has 53,100 D1 cores.

Dojo Interface Processor

Dojo Interface Processor cards (DIP) sit on the edges of the tile arrays and are hooked into the mesh network. Host systems power the DIPs and perform various system management functions. A DIP memory and I/O co-processor hold 32 GB of shared HBM (either HBM2e or

HBM3 High Bandwidth Memory (HBM) is a computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators, network de ...

) – as well as Ethernet interfaces that sidestep the mesh network. Each DIP card has 2 I/O processors with 4 memory banks totaling 32 GB with 800 GB/sec of

bandwidth Bandwidth commonly refers to: * Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range * Bandwidth (computing), the rate of data transfer, bit rate or thr ...

. The DIP plugs into a

PCI-Express PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe, is a high-speed standard used to connect hardware components inside computers. It is designed to replace older expansion bus standards such as PCI, PC ...

4.0 x16 slot that offers 32 GB/sec of bandwidth per card. Five cards per tile edge offer 160 GB/sec of bandwidth to the host servers and 4.5 TB/sec to the tile.

Tesla Transport Protocol

Tesla Transport Protocol (TTP) is a proprietary interconnect over PCI-Express. A 50 GB/sec TTP protocol link runs over Ethernet to access either a single 400 Gb/sec port or a paired set of 200 Gb/sec ports. Crossing the entire two-dimensional mesh network might take 30 hops, while TTP over Ethernet takes only four hops (at lower bandwidth), reducing vertical latency.

Cabinet and ExaPOD

Dojo stacks tiles vertically in a cabinet to minimize the distance and communications time between them. The Dojo ExaPod system includes 120 tiles, totaling 1,062,000 usable cores, reaching 1 exaflops at BF16 and CFloat8 formats. It has 1.3 TB of on-tile SRAM memory and 13 TB of dual in-line

high bandwidth memory High Bandwidth Memory (HBM) is a computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators, network ...

(HBM).

Software

Dojo supports the framework

PyTorch PyTorch is a machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is one of the mo ...

, "Nothing as low level as C or C++, nothing remotely like

CUDA In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...

". The SRAM presents as a single address space. Because

has more precision and range than needed for AI tasks, and

FP16 In computing, half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in a ...

does not have enough, Tesla has devised 8- and 16-bit configurable floating point formats (CFloat8 and CFloat16, respectively) which allow the compiler to dynamically set mantissa and exponent precision, accepting lower precision in return for faster

vector processing In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its Instruction (computer science), instructions are designed to operate efficiently and effectively on large Array d ...

and reduced storage requirements.

References

External links

* * * {{Tesla, Inc.

Dojo A is a hall or place for immersive learning, experiential learning, or meditation. This is traditionally in the field of martial arts. The term literally means "place of the Tao, Way" in Japanese language, Japanese. History The word ''d� ...

Supercomputers Computer vision