Tensor informally refers in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
to two different concepts that organize and represent data. Data may be organized in a
multidimensional array
In computer science, array is a data type that represents a collection of ''elements'' (values or variables), each selected by one or more indices (identifying keys) that can be computed at run time during program execution. Such a collection ...
(''M''-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a
tensor
In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tens ...
is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an ''M''-way array ("data tensor") may be analyzed either by
artificial neural networks or
tensor methods.
Tensor decomposition
In multilinear algebra, a tensor decomposition is any scheme for expressing a "data tensor" (M-way array) as a sequence of elementary operations acting on other, often simpler tensors. Many tensor decompositions generalize some matrix decomposi ...
can factorize data tensors into smaller tensors.
Operations on data tensors can be expressed in terms of
matrix multiplication
In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the ...
and the
Kronecker product
In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation
Operation or Operations may refer to:
Arts, entertainment and media
* ''Operation'' (game), a battery-operated board game that challenges dexterity
* Oper ...
. The computation of gradients, an important aspect of the
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
algorithm, can be performed using
PyTorch
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...
and
TensorFlow
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
.
Computations are often performed on
graphics processing unit
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mo ...
s (GPUs) using
CUDA
CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ...
and on dedicated hardware such as
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
's
Tensor Processing Unit
Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google's own TensorFlow software. Google began using TPUs internally in 2015, and in ...
or
Nvidia
Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
's
Tensor core
Deep learning super sampling (DLSS) is a family of real-time deep learning image enhancement and upscaling technologies developed by Nvidia that are exclusive to its RTX line of graphics cards, and available in a number of video games. The goal ...
. These developments have greatly accelerated neural network architectures and increased the size and complexity of models that can be trained.
History
A
tensor
In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tens ...
is by definition a multilinear map. In mathematics, this may express a multilinear relationship between sets of algebraic objects. In physics,
tensor fields
In mathematics and physics, a tensor field assigns a tensor to each point of a mathematical space (typically a Euclidean space or manifold). Tensor fields are used in differential geometry, algebraic geometry, general relativity, in the analys ...
, considered as tensors at each point in space, are useful in expressing mechanics such as
stress or
elasticity
Elasticity often refers to:
*Elasticity (physics), continuum mechanics of bodies that deform reversibly under stress
Elasticity may also refer to:
Information technology
* Elasticity (data store), the flexibility of the data model and the cl ...
. In machine learning, the exact use of tensors depends on the statistical approach being used.
In 2001, the field of
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing '' signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...
and
statistics were making use of tensor methods. Pierre Comon surveys the early adoption of tensor methods in the fields of telecommunications, radio surveillance, chemometrics and sensor processing. Linear tensor rank methods (such as, Parafac/CANDECOMP) analyzed M-way arrays ("data tensors") composed of higher order statistics that were employed in blind source separation problems to compute a linear model of the data. He noted several early limitations in determining the tensor rank and efficient tensor rank decomposition.
In the early 2000s, multilinear tensor methods
crossed over into computer vision, computer graphics and machine learning with papers by Vasilescu or in collaboration with Terzopoulos, such as Human Motion Signatures,
TensorFaces
TensorTexures
and Multilinear Projection.
Multilinear algebra, the algebra of higher-order tensors, is a suitable and transparent framework for analyzing the multifactor structure of an ensemble of observations and for addressing the difficult problem of disentangling the causal factors based on second order
or higher order statistics associated with each causal factor.
Tensor (multilinear) factor analysis disentangles and reduces the influence of different causal factors with multilinear subspace learning.
When treating an image or a video as a 2- or 3-way array, i.e., "data matrix/tensor", tensor methods reduce spatial or time redundancies as demonstrated by Wang and Ahuja.
Yoshua Bengio,
Geoff Hinton
and their collaborators briefly discuss the relationship between deep neural networks
and tensor factor analysis
beyond the use of M-way arrays ("data tensors") as inputs. One of the early uses of tensors for
neural networks
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
appeared in
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
. A single word can be expressed as a vector via
Word2vec
Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
.
Thus a relationship between two words can be encoded in a matrix. However, for more complex relationships such as subject-object-verb, it is necessary to build higher-dimensional networks. In 2009, the work of Sutskever introduced Bayesian Clustered Tensor Factorization to model relational concepts while reducing the parameter space.
From 2014 to 2015, tensor methods become more common in
convolutional neural networks
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...
(CNNs). Tensor methods organize neural network weights in a "data tensor", analyze and reduce the number of neural network weights. Lebedev et al. accelerated CNN networks for character classification (the recognition of letters and digits in images) by using 4D kernel tensors.
Definition
Let
be a
field
Field may refer to:
Expanses of open ground
* Field (agriculture), an area of land used for agricultural purposes
* Airfield, an aerodrome that lacks the infrastructure of an airport
* Battlefield
* Lawn, an area of mowed grass
* Meadow, a grass ...
such as the
real number
In mathematics, a real number is a number that can be used to measurement, measure a ''continuous'' one-dimensional quantity such as a distance, time, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small var ...
s
or the
complex number
In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted , called the imaginary unit and satisfying the equation i^= -1; every complex number can be expressed in the for ...
s
. A tensor
is an
array over
:
:
Here,
and
are positive integers, and
is the number of dimensions, number of ''ways'', or ''mode'' of the tensor.
One basic approach (not the only way) to using tensors in machine learning is to embed various data types directly. For example, a grayscale image, commonly represented as a discrete 2D function
with resolution
may be embedded in a mode-2 tensor as
:
A color image with 3 channels for RGB might be embedded in a mode-3 tensor with three elements in an additional dimension:
:
In natural language processing, a word might be expressed as a vector
via the
Word2vec
Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
algorithm. Thus
becomes a mode-1 tensor
:
The embedding of subject-object-verb semantics requires embedding relationships among three words. Because a word is itself a vector, subject-object-verb semantics could be expressed using mode-3 tensors
:
In practice the neural network designer is primarily concerned with the specification of embeddings, the connection of tensor layers, and the operations performed on them in a network. Modern machine learning frameworks manage the optimization, tensor factorization and backpropagation automatically.
As unit values

Tensors may be used as the unit values of neural networks which extend the concept of scalar, vector and matrix values to multiple dimensions.
The output value of single layer unit
is the sum-product of its input units and the connection weights filtered through the
activation function
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.
A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or " ...
:
:
where
:

If each output element of
is a scalar, then we have the classical definition of an
artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected units ...
. By replacing each unit component with a tensor, the network is able to express higher dimensional data such as images or videos:
:
This use of tensors to replace unit values is common in
convolutional neural networks
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...
where each unit might be an image processed through multiple layers. By embedding the data in tensors such network structures enable learning of complex data types.
In fully connected layers
Tensors may also be used to compute the layers of a fully connected neural network, where the tensor is applied to the entire layer instead of individual unit values.
The output value of single layer unit
is the sum-product of its input units and the connection weights filtered through the
activation function
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.
A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or " ...
:
:
The vectors
and
of output values can be expressed as a mode-1 tensors, while the hidden weights can be expressed as a mode-2 tensor. In this example the unit values are scalars while the tensor takes on the dimensions of the network layers:
:
:
:
In this notation, the output values can be computed as a tensor product of the input and weight tensors:
:
which computes the sum-product as a tensor multiplication (similar to matrix multiplication).
This formulation of tensors enables the entire layer of a fully connected network to be efficiently computed by mapping the units and weights to tensors.
In convolutional layers
A different reformulation of neural networks allows tensors to express the convolution layers of a neural network. A convolutional layer has multiple inputs, each of which is a spatial structure such as an image or volume. The inputs are convolved by
filtering
Filter, filtering or filters may refer to:
Science and technology
Computing
* Filter (higher-order function), in functional programming
* Filter (software), a computer program to process a data stream
* Filter (video), a software component th ...
before being passed to the next layer. A typical use is to perform feature detection or isolation in image recognition.
Convolution
In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution' ...
is often computed as the multiplication of an input signal
with a filter kernel
. In two dimensions the discrete, finite form is:
:
where
is the width of the kernel.
This definition can be rephrased as a matrix-vector product in terms of tensors that express the kernel, data and inverse transform of the kernel.
:
where
and
are the inverse transform, data and kernel. The derivation is more complex when the filtering kernel also includes a non-linear activation function such as sigmoid or ReLU.
The hidden weights of the convolution layer are the parameters to the filter. These can be reduced with a
pooling layer
In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of comp ...
which reduces the resolution (size) of the data, and can also be expressed as a tensor operation.
Tensor factorization
An important contribution of tensors in machine learning is the ability to
factorize
In mathematics, factorization (or factorisation, see English spelling differences) or factoring consists of writing a number or another mathematical object as a product of several ''factors'', usually smaller or simpler objects of the same kin ...
tensors to decompose data into constituent factors or reduce the learned parameters. Data tensor modeling techniques stem from the linear tensor decomposition (CANDECOMP/Parafac decomposition) and the multilinear tensor decompositions (Tucker).
Tucker decomposition
Tucker decomposition
In mathematics, Tucker decomposition decomposes a tensor into a set of matrices and one small core tensor. It is named after Ledyard R. Tucker
although it goes back to Hitchcock in 1927.
Initially described as a three-mode extension of factor an ...
, for example, takes a 3-way array
and decomposes the tensor into three matrices
and a smaller tensor
. The shape of the matrices and new tensor are such that the total number of elements is reduced. The new tensors have shapes
:
:
:
:
Then the original tensor can be expressed as the tensor product of these four tensors:
:
In the example shown in the figure, the dimensions of the tensors are
:
: I=8, J=6, K=3,
: I=8, P=5,
: J=6, Q=4,
: K=3, R=2,
: P=5, Q=4, R=2.
The total number of elements in the Tucker factorization is
:
:
The number of elements in the original
is 144, resulting in a data reduction from 144 down to 110 elements, a reduction of 23% in parameters or data size. For much larger initial tensors, and depending on the rank (redundancy) of the tensor, the gains can be more significant.
The work of Rabanser et al. provides an introduction to tensors with more details on the extension of Tucker decomposition to N-dimensions beyond the mode-3 example given here.
Tensor trains
Another technique for decomposing tensors rewrites the initial tensor as a sequence (train) of smaller sized tensors. A tensor-train (TT) is a sequence of tensors of reduced rank, called ''canonical factors''. The original tensor can be expressed as the sum-product of the sequence.
:
Developed in 2011 by Ivan Oseledts, the author observes that Tucker decomposition is "suitable for small dimensions, especially for the three-dimensional case. For large ''d'' it is not suitable." Thus tensor-trains can be used to factorize larger tensors in higher dimensions.
Tensor graphs
The unified data architecture and automatic differentiation of tensors has enabled higher-level designs of machine learning in the form of tensor graphs. This leads to new architectures, such as tensor-graph convolutional networks (TGCN), which identify highly non-linear associations in data, combine multiple relations, and scale gracefully, while remaining robust and performant.
These developments are impacting all areas of machine learning, such as text mining and clustering, time varying data, and neural networks wherein the input data is a social graph and the data changes dynamically.
Hardware
Tensors provide a unified way to train neural networks for more complex data sets. However, training is expensive to compute on classical CPU hardware.
In 2014,
Nvidia
Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
develope
cuDNN CUDA Deep Neural Network, a library for a set of optimized primitives written in the parallel CUDA language. CUDA and thus cuDNN run on dedicated GPUs that implement unified massive parallelism in hardware. These GPUs were not yet dedicated chips for tensors, but rather existing hardware adapted for parallel computation in machine learning.
In the period 2015–2017
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
invented the Tensor Processing Unit (TPU). TPUs are dedicated, fixed function hardware units that specialize in the matrix multiplications needed for tensor products. Specifically, they implement an array of 65,536 multiply units that can perform a 256x256 matrix sum-product in just one global instruction cycle.
Later in 2017, Nvidia released its own Tensor Core with the Volta GPU architecture. Each Tensor Core is a microunit that can perform a 4x4 matrix sum-product. There are eight tensor cores for each shared memory (SM) block. The first GV100 GPU card has 108 SMs resulting in 672 tensor cores. This device accelerated machine learning by 12x over the previous Tesla GPUs. The number of tensor cores scales as the number of cores and SM units continue to grow in each new generation of cards.
The development of GPU hardware, combined with the unified architecture of tensor cores, has enabled the training of much larger neural networks. In 2022, the largest neural network was Google's
PaLM
Palm most commonly refers to:
* Palm of the hand, the central region of the front of the hand
* Palm plants, of family Arecaceae
** List of Arecaceae genera
* Several other plants known as "palm"
Palm or Palms may also refer to:
Music
* Palm (b ...
with 540 billion learned parameters (network weights) (the older
GPT-3
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt.
The architecture is a standa ...
language model has over 175 billion learned parameters that produces human-like text; size isn't everything, Stanford's much smaller 2023 Alpaca model claims to be better,
having learned from Meta/Facebook's 2023 model
LLaMA
The llama (; ) (''Lama glama'') is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the Pre-Columbian era.
Llamas are social animals and live with others as a herd. Their wool is so ...
, the smaller 7 billion parameter variant). The widely popular chatbot
ChatGPT
ChatGPT (Generative Pre-trained Transformer) is a chatbot launched by OpenAI in November 2022. It is built on top of OpenAI's GPT-3 family of large language models, and is fine-tuned (an approach to transfer learning) with both supervised and ...
is built on top of
GPT-3.5 (and after an update
GPT-4
Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI and the fourth in its GPT series. It was released on March 14, 2023, and has been made publicly available in a limited form via ChatGPT Plus, ...
) using supervised and reinforcement learning.
References
{{reflist
*
Machine learning
Tensors