In mathematics, the Johnson–Lindenstrauss lemma is a result named after
William B. Johnson and
Joram Lindenstrauss
Joram Lindenstrauss ( he, יורם לינדנשטראוס) (October 28, 1936 – April 29, 2012) was an Israeli mathematician working in functional analysis. He was a professor of mathematics at the Einstein Institute of Mathematics.
Biogra ...
concerning low-distortion
embedding
In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup.
When some object X is said to be embedded in another object Y, the embedding is g ...
s of points from high-dimensional into low-dimensional
Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidea ...
. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are
nearly preserved. The map used for the embedding is at least
Lipschitz Lipschitz, Lipshitz, or Lipchitz, is an Ashkenazi Jewish (Yiddish/German-Jewish) surname. The surname has many variants, including: Lifshitz ( Lifschitz), Lifshits, Lifshuts, Lefschetz; Lipschitz, Lipshitz, Lipshits, Lopshits, Lipschutz (Lip ...
, and can even be taken to be an
orthogonal projection
In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself (an endomorphism) such that P\circ P=P. That is, whenever P is applied twice to any vector, it gives the same result as if it wer ...
.
The lemma has applications in
compressed sensing
Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This ...
,
manifold learning
Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low- ...
,
dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
, and
graph embedding
In topological graph theory, an embedding (also spelled imbedding) of a graph G on a surface \Sigma is a representation of G on \Sigma in which points of \Sigma are associated with vertices and simple arcs (homeomorphic images of ,1/math> ...
. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing an ...
for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases. It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure. The Johnson–Lindenstrauss lemma is a classic result in this vein.
Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size ''m'' that needs dimension
:
in order to preserve the distances between all pairs of points within a factor of
.
Lemma
Given
, a set
of
points in
(
), and an integer
, there is a linear map
such that
:
for all
.
The formula can be rearranged:
Alternatively, for any
and any integer
[Or any integer ] there exists a linear function
such that the restriction
is
-
bi-Lipschitz.
[This result follows from the above result. Sketch of proof: Note and for all . Do casework for 1=''m'' and 1<''m'', applying the above result to in the latter case, noting ]
One proof of the lemma takes ''ƒ'' to be a suitable multiple of the orthogonal projection onto a random subspace of dimension
in
, and exploits the phenomenon of
concentration of measure
In mathematics, concentration of measure (about a median) is a principle that is applied in measure theory, probability and combinatorics, and has consequences for other fields such as Banach space theory. Informally, it states that "A random v ...
.
An orthogonal projection will, in general, reduce the average distance between points, but the lemma can be viewed as dealing with ''relative distances'', which do not change under scaling. In a nutshell, you roll the dice and obtain a random projection, which will reduce the average distance, and then you scale up the distances so that the average distance returns to its previous value. If you keep rolling the dice, you will, in polynomial random time, find a projection for which the (scaled) distances satisfy the lemma.
Alternate statement
A related lemma is the distributional JL lemma. This lemma states that for any
and positive integer
, there exists a distribution over
from which the matrix
is drawn such that for
and for any unit-length vector
, the claim below holds.
:
One can obtain the JL lemma from the distributional version by setting
and
for some pair ''u'',''v'' both in ''X''. Then the JL lemma follows by a union bound over all such pairs.
Speeding up the JL transform
Given ''A'', computing the matrix vector product takes ''O''(''kd'') time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than ''O''(''kd'') time.
There are two major lines of work. The first, ''Fast Johnson Lindenstrauss Transform'' (FJLT), was introduced by Ailon and
Chazelle in 2006.
This method allows the computation of the matrix vector product in just
for any constant
.
Another approach is to build a distribution supported over matrices that are sparse.
This method allows keeping only an
fraction of the entries in the matrix, which means the computation can be done in just
time.
Furthermore, if the vector has only
non-zero entries, the Sparse JL takes time
, which may be much less than the
time used by Fast JL.
Tensorized random projections
It is possible to combine two JL matrices by taking the so-called
face-splitting product, which is defined as the tensor products of the rows (was proposed by
V. Slyusar[Anna Esteve, Eva Boj & Josep Fortiana (2009): Interaction Terms in Distance-Based Regression, Communications in Statistics - Theory and Methods, 38:19, P. 350]
/ref> in 1996 for radar
Radar is a detection system that uses radio waves to determine the distance (''ranging''), angle, and radial velocity of objects relative to the site. It can be used to detect aircraft, Marine radar, ships, spacecraft, guided missiles, motor v ...
and digital antenna array applications).
More directly, let and be two matrices.
Then the face-splitting product is
:
This idea of tensorization was used by Kasiviswanathan et al. 2010[Kasiviswanathan, Shiva Prasad, et al. "The price of privately releasing contingency tables and the spectra of random matrices with correlated rows." Proceedings of the forty-second ACM symposium on Theory of computing. 2010.] for differential privacy
Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is t ...
.
JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity:
:,
where is the element-wise ( Hadamard) product.
Such computations have been used to efficiently compute polynomial kernel
In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of th ...
s and many other .[Woodruff, David P. "Sketching as a Tool for Numerical Linear Algebra." Theoretical Computer Science 10.1-2 (2014): 1-157.]
In 2020 it was shown that if the matrices are independent or Gaussian matrices, the combined matrix satisfies the distributional JL lemma if the number of rows is at least
:.
For large this is as good as the completely random Johnson-Lindenstrauss, but
a matching lower bound in the same paper shows that this exponential dependency on is necessary.
Alternative JL constructions are suggested to circumvent this.
See also
* Random projection
*Restricted isometry property In linear algebra, the restricted isometry property (RIP) characterizes matrices which are nearly orthonormal, at least when operating on sparse vectors. The concept was introduced by Emmanuel Candès and Terence TaoE. J. Candes and T. Tao, "Deco ...
Notes
References
Further reading
*. Journal version of a paper previously appearing at PODC 2001.
*.
*.
*
*
*
{{DEFAULTSORT:Johnson-Lindenstrauss lemma
Lemmas
Metric geometry