
t-distributed stochastic neighbor embedding (t-SNE) is a
statistical
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by
Geoffrey Hinton and Sam Roweis,
where Laurens van der Maaten and Hinton proposed the
''t''-distributed variant.
It is a
nonlinear dimensionality reduction
Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear de ...
technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
The t-SNE algorithm comprises two main stages. First, t-SNE constructs a
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
(KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
between objects as the base of its similarity metric, this can be changed as appropriate. A
Riemannian variant is
UMAP.
t-SNE has been used for visualization in a wide range of applications, including
genomics
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
,
computer security
Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and computer network, n ...
research,
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
,
music analysis,
cancer research
Cancer research is research into cancer to identify causes and develop strategies for prevention, diagnosis, treatment, and cure.
Cancer research ranges from epidemiology, molecular bioscience to the performance of clinical trials to evaluate ...
,
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, geological domain interpretation, and biomedical signal processing.
For a data set with ''n'' elements, t-SNE runs in time and requires space.
Details
Given a set of
high-dimensional objects
, t-SNE first computes probabilities
that are proportional to the similarity of objects
and
, as follows.
For
, define
:
and set
.
Note the above denominator ensures
for all
.
As van der Maaten and Hinton explained: "The similarity of datapoint
to datapoint
is the conditional probability,
, that
would pick
as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at
."
[
Now define
:
This is motivated because and from the N samples are estimated as 1/N, so the conditional probability can be written as and . Since , you can obtain previous formula.
Also note that and .
The bandwidth of the ]Gaussian kernel
In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form
f(x) = \exp (-x^2)
and with parametric extension
f(x) = a \exp\left( -\frac \right)
for arbitrary real constants , and non-zero . It is ...
s is set in such a way that the entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
of the conditional distribution equals a predefined entropy using the bisection method
In mathematics, the bisection method is a root-finding method that applies to any continuous function for which one knows two values with opposite signs. The method consists of repeatedly bisecting the interval defined by these values and t ...
. As a result, the bandwidth is adapted to the density
Density (volumetric mass density or specific mass) is the ratio of a substance's mass to its volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' (or ''d'') can also be u ...
of the data: smaller values of are used in denser parts of the data space. The entropy increases with the perplexity of this distribution ; this relation is seen as
:
where is the Shannon entropy
The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.".[
Since the Gaussian kernel uses the ]Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
, it is affected by the curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...
, and in high dimensional data when distances lose the ability to discriminate, the become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this.
t-SNE aims to learn a -dimensional map (with and typically chosen as 2 or 3) that reflects the similarities as well as possible. To this end, it measures similarities between two points in the map and , using a very similar approach.
Specifically, for , define as
:
and set .
Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution
The Cauchy distribution, named after Augustin-Louis Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution, Lorentz(ian) ...
) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.
The locations of the points in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
of the distribution from the distribution , that is:
:
The minimization of the Kullback–Leibler divergence with respect to the points is performed using gradient descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the gradi ...
.
The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.
Output
While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering, and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters. Thus, interactive exploration may be needed to choose parameters and validate results. It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering
In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided ...
.
Software
* A C++ implementation of Barnes-Hut is available on th
github account
of one of the original authors.
* The R packag
Rtsne
implements t-SNE in R.
* ELKI contains tSNE, also with Barnes-Hut approximation
* scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support ...
, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation.
* Tensorboard, the visualization kit associated with TensorFlow
TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...
, also implements t-SNE
online version
* The Julia packag
TSne
implements t-SNE
References
External links
* {{Cite journal , last1=Wattenberg , first1=Martin , last2=Viégas , first2=Fernanda , last3=Johnson , first3=Ian , date=2016-10-13 , title=How to Use t-SNE Effectively , url=https://distill.pub/2016/misread-tsne/ , journal=Distill , language=en , volume=1 , issue=10 , pages=e2 , doi=10.23915/distill.00002 , issn=2476-0757, doi-access=free . Interactive demonstration and tutorial.
Visualizing Data Using t-SNE
Google Tech Talk about t-SNE
Implementations of t-SNE in various languages
A link collection maintained by Laurens van der Maaten
Machine learning algorithms
Dimension reduction