T-SNE visualisation of word embeddings generated using 19th century literature

t-distributed stochastic neighbor embedding (t-SNE) is a

statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten and Hinton proposed the ''t''-distributed variant. It is a

nonlinear dimensionality reduction Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear de ...

technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

(KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...

between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP. t-SNE has been used for visualization in a wide range of applications, including

genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...

computer security Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and computer network, n ...

research,

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, music analysis,

cancer research Cancer research is research into cancer to identify causes and develop strategies for prevention, diagnosis, treatment, and cure. Cancer research ranges from epidemiology, molecular bioscience to the performance of clinical trials to evaluate ...

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

, geological domain interpretation, and biomedical signal processing. For a data set with ''n'' elements, t-SNE runs in time and requires space.

Details

Given a set of

N

high-dimensional objects

\mathbf_1, \dots, \mathbf_N

, t-SNE first computes probabilities

p_

that are proportional to the similarity of objects

\mathbf_i

and

\mathbf_j

, as follows. For

i \neq j

, define :

p_ = \frac

and set

p_ = 0

. Note the above denominator ensures

\sum_j p_ = 1

for all

i

. As van der Maaten and Hinton explained: "The similarity of datapoint

x_j

to datapoint

x_i

is the conditional probability,

p_

, that

x_i

would pick

x_j

as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at

x_i

." Now define :

p_ = \frac

This is motivated because

p_

and

p_

from the N samples are estimated as 1/N, so the conditional probability can be written as

p_ = Np_

and

p_ = Np_

. Since

p_ =  p_

, you can obtain previous formula. Also note that

p_ = 0

and

\sum_ p_ = 1

. The bandwidth of the

Gaussian kernel In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form f(x) = \exp (-x^2) and with parametric extension f(x) = a \exp\left( -\frac \right) for arbitrary real constants , and non-zero . It is ...

\sigma_i

is set in such a way that the

entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

of the conditional distribution equals a predefined entropy using the

bisection method In mathematics, the bisection method is a root-finding method that applies to any continuous function for which one knows two values with opposite signs. The method consists of repeatedly bisecting the interval defined by these values and t ...

. As a result, the bandwidth is adapted to the

density Density (volumetric mass density or specific mass) is the ratio of a substance's mass to its volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' (or ''d'') can also be u ...

of the data: smaller values of

\sigma_i

are used in denser parts of the data space. The entropy increases with the perplexity of this distribution

P_i

; this relation is seen as :

Perp(P_i) = 2^

where

H(P_i)

is the Shannon entropy

H(P_i)=-\sum_jp_\log_2p_.

The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.". Since the Gaussian kernel uses the

\lVert x_i-x_j \rVert

, it is affected by the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...

, and in high dimensional data when distances lose the ability to discriminate, the

p_

become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this. t-SNE aims to learn a

d

-dimensional map

\mathbf_1, \dots, \mathbf_N

(with

\mathbf_i \in \mathbb^d

and

d

typically chosen as 2 or 3) that reflects the similarities

p_

as well as possible. To this end, it measures similarities

q_

between two points in the map

\mathbf_i

and

\mathbf_j

, using a very similar approach. Specifically, for

i \neq j

, define

q_

as :

q_ = \frac

and set

q_ = 0

. Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a

Cauchy distribution The Cauchy distribution, named after Augustin-Louis Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution, Lorentz(ian) ...

) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points

\mathbf_i

in the map are determined by minimizing the (non-symmetric)

of the distribution

P

from the distribution

Q

, that is: :

\mathrm\left(P \parallel Q\right) = \sum_ p_ \log \frac

The minimization of the Kullback–Leibler divergence with respect to the points

\mathbf_i

is performed using

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

Output

While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering, and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters. Thus, interactive exploration may be needed to choose parameters and validate results. It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of

spectral clustering In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided ...

Software

* A C++ implementation of Barnes-Hut is available on th
github account
of one of the original authors. * The R packag
Rtsne
implements t-SNE in R. * ELKI contains tSNE, also with Barnes-Hut approximation *

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...

, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. * Tensorboard, the visualization kit associated with

TensorFlow TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...

, also implements t-SNE
online version
* The Julia packag
TSne
implements t-SNE

References

External links

* {{Cite journal , last1=Wattenberg , first1=Martin , last2=Viégas , first2=Fernanda , last3=Johnson , first3=Ian , date=2016-10-13 , title=How to Use t-SNE Effectively , url=https://distill.pub/2016/misread-tsne/ , journal=Distill , language=en , volume=1 , issue=10 , pages=e2 , doi=10.23915/distill.00002 , issn=2476-0757, doi-access=free . Interactive demonstration and tutorial.
Visualizing Data Using t-SNE
Google Tech Talk about t-SNE
Implementations of t-SNE in various languages
A link collection maintained by Laurens van der Maaten Machine learning algorithms Dimension reduction