HOME

TheInfoList



OR:

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in
high-dimensional space In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordin ...
s that do not occur in low-dimensional settings such as the
three-dimensional Three-dimensional space (also: 3D space, 3-space or, rarely, tri-dimensional space) is a geometric setting in which three values (called ''parameters'') are required to determine the position of an element (i.e., point). This is the informal ...
physical space Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually consider ...
of everyday experience. The expression was coined by Richard E. Bellman when considering problems in
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...
. Dimensionally cursed phenomena occur in domains such as
numerical analysis Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods t ...
, sampling, combinatorics,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machin ...
, data mining and
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
s. The common theme of these problems is that when the dimensionality increases, the
volume Volume is a measure of occupied three-dimensional space. It is often quantified numerically using SI derived units (such as the cubic metre and litre) or by various imperial or US customary units (such as the gallon, quart, cubic inch). The def ...
of the space increases so fast that the available data become sparse. In order to obtain a reliable result, the amount of data needed often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.


Domains


Combinatorics

In some problems, each variable can take one of several discrete values, or the range of possible values is divided to give a finite number of possibilities. Taking the variables together, a huge number of combinations of values must be considered. This effect is also known as the combinatorial explosion. Even in the simplest case of d binary variables, the number of possible combinations already is 2^d, exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations.


Sampling

There is an exponential increase in volume associated with adding extra dimensions to a
mathematical space In mathematics, a space is a set (sometimes called a universe) with some added structure. While modern mathematics uses many types of spaces, such as Euclidean spaces, linear spaces, topological spaces, Hilbert spaces, or probability spaces ...
. For example, 102 = 100 evenly spaced sample points suffice to sample a
unit interval In mathematics, the unit interval is the closed interval , that is, the set of all real numbers that are greater than or equal to 0 and less than or equal to 1. It is often denoted ' (capital letter ). In addition to its role in real analysis ...
(a "1-dimensional cube") with no more than 10−2 = 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice that has a spacing of 10−2 = 0.01 between adjacent points would require 1020 = 102)10sample points. In general, with a spacing distance of 10−''n'' the 10-dimensional hypercube appears to be a factor of 10''n''(10−1) = 10''n'')10/(10''n'')"larger" than the 1-dimensional hypercube, which is the unit interval. In the above example ''n'' = 2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 1018 "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.


Optimization

When solving dynamic
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
problems by numerical backward induction, the objective function must be computed for each combination of values. This is a significant obstacle when the dimension of the "state variable" is large.


Machine learning

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machin ...
problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional
feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. In an abstract sense, as the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially. A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation. In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machin ...
and insofar as predictive performance is concerned, the curse of dimensionality is used interchangeably with the ''peaking phenomenon'', which is also known as ''Hughes phenomenon''. This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily. Nevertheless, in the context of a ''simple'' classifier (
linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...
in the multivariate Gaussian model under the assumption of a common known covariance matrix) Zollanvari et al. showed both analytically and empirically that as long as the relative cumulative efficacy of an additional feature set (with respect to features that are already part of the classifier) is greater (or less) than the size of this additional feature set, the expected error of the classifier constructed using these additional features will be less (or greater) than the expected error of the classifier constructed without them. In other words, both the size of additional features and their (relative) cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power.


Data mining

In data mining, the curse of dimensionality refers to a data set with too many features. Consider the first table, which depicts 200 individuals and 2000 genes (features) with a 1 or 0 denoting whether or not they have a genetic mutation in that gene. A data mining application to this data set may be finding the correlation between specific genetic mutations and creating a classification algorithm such as a
decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condi ...
to determine whether an individual has cancer or not. A common practice of data mining in this domain would be to create association rules between genetic mutations that lead to the development of cancers. To do this, one would have to loop through each genetic mutation of each individual and find other genetic mutations that occur over a desired threshold and create pairs. They would start with pairs of two, then three, then four until they result in an empty set of pairs. The complexity of this algorithm can lead to calculating all permutations of gene pairs for each individual or row. Given the formula for calculating the permutations of n items with a group size of r is: \frac, calculating the number of three pair permutations of any given individual would be 7988004000 different pairs of genes to evaluate for each individual. The number of pairs created will grow by an order of factorial as the size of the pairs increase. The growth is depicted in the permutation table (see right). As we can see from the permutation table above, one of the major problems data miners face regarding the curse of dimensionality is that the space of possible parameter values grows exponentially or factorially as the number of features in the data set grows. This problem critically affects both computational time and space when searching for associations or optimal features to consider. Another problem data miners may face when dealing with too many features is the notion that the number of false predictions or classifications tend to increase as the number of features grow in the data set. In terms of the classification problem discussed above, keeping every data point could lead to a higher number of
false positives and false negatives A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
in the model. This may seem counter intuitive but consider the genetic mutation table from above, depicting all genetic mutations for each individual. Each genetic mutation, whether they correlate with cancer or not, will have some input or weight in the model that guides the decision-making process of the algorithm. There may be mutations that are
outliers In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter ar ...
or ones that dominate the overall distribution of genetic mutations when in fact they do not correlate with cancer. These features may be working against one's model, making it more difficult to obtain optimal results. This problem is up to the data miner to solve, and there is no universal solution. The first step any data miner should take is to explore the data, in an attempt to gain an understanding of how it can be used to solve the problem. One must first understand what the data means, and what they are trying to discover before they can decide if anything must be removed from the data set. Then they can create or use a
feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
or
dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
algorithm to remove samples or features from the data set if they deem it necessary. One example of such methods is the
interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference ...
method, used to remove
outliers In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter ar ...
in a data set by calculating the standard deviation of a feature or occurrence.


Distance function

When a measure such as a
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore o ...
is defined using many coordinates, there is little difference in the distances between different pairs of samples. One way to illustrate the "vastness" of high-dimensional Euclidean space is to compare the proportion of an inscribed
hypersphere In mathematics, an -sphere or a hypersphere is a topological space that is homeomorphic to a ''standard'' -''sphere'', which is the set of points in -dimensional Euclidean space that are situated at a constant distance from a fixed point, ca ...
with radius r and dimension d, to that of a
hypercube In geometry, a hypercube is an ''n''-dimensional analogue of a square () and a cube (). It is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions, perp ...
with edges of length 2r. The volume of such a sphere is \frac, where \Gamma is the
gamma function In mathematics, the gamma function (represented by , the capital letter gamma from the Greek alphabet) is one commonly used extension of the factorial function to complex numbers. The gamma function is defined for all complex numbers except t ...
, while the volume of the cube is (2r)^d. As the dimension d of the space increases, the hypersphere becomes an insignificant volume relative to that of the hypercube. This can clearly be
seen Seen may refer to: * ''Seen'' (album), by Tom Bailey, 2001 * Seen (artist) (born 1961), American graffiti artist * Seen (Winterthur), a district of Winterthur, Switzerland * Shin (letter) Shin (also spelled Šin (') or Sheen) is the twenty-fir ...
by comparing the proportions as the dimension d goes to infinity: :\frac=\frac\rightarrow 0 as d \rightarrow \infty. Furthermore, the distance between the center and the corners is r\sqrt, which increases without bound for fixed r. In this sense when points are uniformly generated in a high-dimensional hypercube, almost all points are much farther than r units away from the centre. In high dimensions, the volume of the ''d''-dimensional unit hypercube (with coordinates of the vertices \pm 1 ) is concentrated near a sphere with the radius \sqrt/\sqrt for large dimension ''d''. Indeed, for each coordinate x_i the average value of x_i^2 in the cube is :\langle x_i^2\rangle =\frac\int_^x^2 dx=\frac. The variance of x_i^2 for uniform distribution in the cube is :\frac\int_^x^4 dx-\langle x_i^2\rangle^2=\frac Therefore, the squared distance from the origin, r^2=\sum_i x_i^2 has the average value ''d''/3 and variance 4''d''/45. For large ''d'', distribution of r^2/d is close to the
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu i ...
with the mean 1/3 and the standard deviation \frac according to the
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...
. Thus, when uniformly generating points in high dimensions, both the "middle" of the hypercube, and the corners are empty, and all the volume is concentrated near the surface of a sphere of "intermediate" radius \sqrt. This also helps to understand the chi-squared distribution. Indeed, the (non-central) chi-squared distribution associated to a random point in the interval 1, 1is the same as the distribution of the length-squared of a random point in the ''d''-cube. By the law of large numbers, this distribution concentrates itself in a narrow band around ''d'' times the standard deviation squared (σ2) of the original derivation. This illuminates the chi-squared distribution and also illustrates that most of the volume of the ''d''-cube concentrates near the boundary of a sphere of radius \sigma\sqrt. A further development of this phenomenon is as follows. Any fixed distribution on the
real number In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every r ...
s induces a product distribution on points in \mathbb^d. For any fixed ''n'', it turns out that the difference between the minimum and the maximum distance between a random reference point ''Q'' and a list of ''n'' random data points ''P''1,...,''P''''n'' become indiscernible compared to the minimum distance: :\lim_ E\left(\frac\right) \to 0. This is often cited as distance functions losing their usefulness (for the nearest-neighbor criterion in feature-comparison algorithms, for example) in high dimensions. However, recent research has shown this to only hold in the artificial scenario when the one-dimensional distributions \mathbb are
independent and identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usual ...
. When attributes are correlated, data can become easier and provide higher distance contrast and the
signal-to-noise ratio Signal-to-noise ratio (SNR or S/N) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to the noise power, often expressed in de ...
was found to play an important role, thus
feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
should be used. More recently, it has been suggested that there may be a conceptual flaw in the argument that contrast-loss creates a curse in high dimensions. Machine learning can be understood as the problem of assigning instances to their respective generative process of origin, with class labels acting as symbolic representations of individual generative processes. The curse's derivation assumes all instances are independent, identical outcomes of a single high dimensional generative process. If there is only one generative process, there would exist only one (naturally occurring) class and machine learning would be conceptually ill-defined in both high and low dimensions. Thus, the traditional argument that contrast-loss creates a curse, may be fundamentally inappropriate. In addition, it has been shown that when the generative model is modified to accommodate multiple generative processes, contrast-loss can morph from a curse to a blessing, as it ensures that the nearest-neighbor of an instance is almost-surely its most closely related instance. From this perspective, contrast-loss makes high dimensional distances especially meaningful and not especially non-meaningful as is often argued.


Nearest neighbor search

The effect complicates
nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
in high dimensional space. It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions. However, it has recently been observed that the mere number of dimensions does not necessarily result in difficulties, since ''relevant'' additional dimensions can also increase the contrast. In addition, for the resulting ranking it remains useful to discern close and far neighbors. Irrelevant ("noise") dimensions, however, reduce the contrast in the manner described above. In
time series analysis In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Ex ...
, where the data are inherently high-dimensional, distance functions also work reliably as long as the
signal-to-noise ratio Signal-to-noise ratio (SNR or S/N) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to the noise power, often expressed in de ...
is high enough.


''k''-nearest neighbor classification

Another effect of high dimensionality on distance functions concerns ''k''-nearest neighbor (''k''-NN)
graphs Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties * Graph (topology), a topological space resembling a graph in the sense of discr ...
constructed from a
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of th ...
using a distance function. As the dimension increases, the
indegree In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph that is made up of a set of vertices connected by directed edges, often called arcs. Definition In formal terms, a directed graph is an ordered pai ...
distribution of the ''k''-NN digraph becomes skewed with a peak on the right because of the emergence of a disproportionate number of hubs, that is, data-points that appear in many more ''k''-NN lists of other data-points than the average. This phenomenon can have a considerable impact on various techniques for
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
(including the ''k''-NN classifier),
semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of o ...
, and clustering, and it also affects
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
.


Anomaly detection

In a 2012 survey, Zimek et al. identified the following problems when searching for anomalies in high-dimensional data: # Concentration of scores and distances: derived values such as distances become numerically similar # Irrelevant attributes: in high dimensional data, a significant number of attributes may be irrelevant # Definition of reference sets: for local methods, reference sets are often nearest-neighbor based # Incomparable scores for different dimensionalities: different subspaces produce incomparable scores # Interpretability of scores: the scores often no longer convey a semantic meaning # Exponential search space: the search space can no longer be systematically scanned #
Data snooping Data dredging (also known as data snooping or ''p''-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. T ...
bias: given the large search space, for every desired significance a hypothesis can be found # Hubness: certain objects occur more frequently in neighbor lists than others. Many of the analyzed specialized methods tackle one or another of these problems, but there remain many open research questions.


Blessing of dimensionality

Surprisingly and despite the expected "curse of dimensionality" difficulties, common-sense heuristics based on the most straightforward methods "can yield results which are almost surely optimal" for high-dimensional problems. The term "blessing of dimensionality" was introduced in the late 1990s. Donoho in his "Millennium manifesto" clearly explained why the "blessing of dimensionality" will form a basis of future data mining. The effects of the blessing of dimensionality were discovered in many applications and found their foundation in the concentration of measure phenomena. One example of the blessing of dimensionality phenomenon is linear separability of a random point from a large finite random set with high probability even if this set is exponentially large: the number of elements in this random set can grow exponentially with dimension. Moreover, this linear functional can be selected in the form of the simplest linear Fisher discriminant. This separability theorem was proven for a wide class of probability distributions: general uniformly log-concave distributions, product distributions in a cube and many other families (reviewed recently in ). "The blessing of dimensionality and the curse of dimensionality are two sides of the same coin." For example, the typical property of essentially high-dimensional probability distributions in a high-dimensional space is: the squared distance of random points to a selected point is, with high probability, close to the average (or median) squared distance. This property significantly simplifies the expected geometry of data and indexing of high-dimensional data (blessing), but, at the same time, it makes the similarity search in high dimensions difficult and even useless (curse). Zimek et al. noted that while the typical formalizations of the curse of dimensionality affect
i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usual ...
data, having data that is separated in each attribute becomes easier even in high dimensions, and argued that the
signal-to-noise ratio Signal-to-noise ratio (SNR or S/N) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to the noise power, often expressed in de ...
matters: data becomes easier with each attribute that adds signal, and harder with attributes that only add noise (irrelevant error) to the data. In particular for unsupervised data analysis this effect is known as swamping.


See also

*
Bellman equation A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the "value" of a decision problem at a certain point in time ...
* Clustering high-dimensional data *
Concentration of measure In mathematics, concentration of measure (about a median) is a principle that is applied in measure theory, probability and combinatorics, and has consequences for other fields such as Banach space theory. Informally, it states that "A random var ...
*
Dimension reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
*
Dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...
*
Fourier-related transforms This is a list of linear transformations of functions related to Fourier analysis. Such transformations map a function to a set of coefficients of basis functions, where the basis functions are sinusoidal and are therefore strongly localized in th ...
* Linear least squares * Model order reduction *
Multilinear PCA Within statistics, Multilinear principal component analysis (MPCA) is a multilinear extension of principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dime ...
*
Multilinear subspace learning Multilinear subspace learning is an approach to dimensionality reduction.M. A. O. Vasilescu, D. Terzopoulos (2003"Multilinear Subspace Analysis of Image Ensembles" "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVP ...
*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
*
Singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...


References

{{Reflist, 30em Numerical analysis Dynamic programming Machine learning Dimension