{{Short description, Data representation for machine learning
Most real world data sets consist of data vectors whose individual components are not
statistically independent
Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two event (probability theory), events are independent, statistically independent, or stochastically independent if, informally s ...
. In other words, knowing the value of an element will provide information about the value of elements in the data vector. When this occurs, it can be desirable to create a factorial code of the data, i.e., a new vector-valued
representation
Representation may refer to:
Law and politics
*Representation (politics), political activities undertaken by elected representatives, as well as other theories
** Representative democracy, type of democracy in which elected officials represent a ...
of each data vector such that it gets uniquely encoded by the resulting code vector (loss-free coding), but the code components are statistically independent.
Later
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
usually works much better when the raw input data is first translated into such a factorial code. For example, suppose the final goal is to classify images with highly redundant pixels. A
naive Bayes classifier
In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
will assume the pixels are
statistically independent
Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two event (probability theory), events are independent, statistically independent, or stochastically independent if, informally s ...
random variables
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...
and therefore fail to produce good results. If the data are first encoded in a factorial way, however, then the naive Bayes classifier will achieve its
optimal
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
performance (compare Schmidhuber et al. 1996).
To create factorial codes,
Horace Barlow
Horace Basil Barlow FRS (8 December 1921 – 5 July 2020) was a British vision scientist.
Early life
Barlow was the son of the civil servant Sir Alan Barlow and his wife Lady Nora (granddaughter of the naturalist Charles Darwin). Barlow was ...
and co-workers suggested to minimize the sum of the
bit
The bit is the most basic unit of information in computing and digital communication. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented as ...
entropies of the code components of
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two values (0 and 1) for each digit
* Binary function, a function that takes two arguments
* Binary operation, a mathematical op ...
codes (1989).
Jürgen Schmidhuber
Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artifici ...
(1992) re-formulated the problem in terms of predictors and binary
feature
Feature may refer to:
Computing
* Feature recognition, could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (machine learning), in statistics: individual measurable properties of the phenome ...
detectors
A sensor is often defined as a device that receives and responds to a signal or stimulus. The stimulus is the quantity, property, or condition that is sensed and converted into electrical signal.
In the broadest definition, a sensor is a devi ...
, each receiving the raw data as an input. For each detector there is a predictor that sees the other detectors and learns to predict the output of its own detector in response to the various input vectors or images. But each detector uses a
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithm to become as unpredictable as possible. The
global optimum
In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative' ...
of this
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
corresponds to a factorial code represented in a distributed fashion across the outputs of the feature detectors.
Painsky, Rosset and Feder (2016, 2017) further studied this problem in the context of
independent component analysis
In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate statistics, multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and ...
over finite alphabet sizes. Through a series of theorems they show that the factorial coding problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. In addition, they introduce a simple transformation (namely, order permutation) which provides a greedy yet very effective approximation of the optimal solution. Practically, they show that with a careful implementation, the favorable properties of the order permutation may be achieved in an asymptotically optimal computational complexity. Importantly, they provide theoretical guarantees, showing that while not every random vector can be efficiently decomposed into independent components, the majority of vectors do decompose very well (that is, with a small constant cost), as the dimension increases. In addition, they demonstrate the use of factorial codes to data compression in multiple setups (2017).
Factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observe ...
*
Unsupervised learning
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
*
Image processing
An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...
*
Signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...
References
*
Horace Barlow
Horace Basil Barlow FRS (8 December 1921 – 5 July 2020) was a British vision scientist.
Early life
Barlow was the son of the civil servant Sir Alan Barlow and his wife Lady Nora (granddaughter of the naturalist Charles Darwin). Barlow was ...
, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural Computation, 1:412-423, 1989.
*
Jürgen Schmidhuber
Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artifici ...
. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992
* J. Schmidhuber and M. Eldracher and B. Foltin. Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996
* A. Painsky, S. Rosset and M. Feder. Generalized independent component analysis over finite alphabets. IEEE Transactions on Information Theory, 62(2):1038-1053, 2016
* A. Painsky, S. Rosset and M. Feder. Large Alphabet Source Coding using Independent Component Analysis. IEEE Transactions on Information Theory, 63(10):6514 - 6529, 2017
Independence (probability theory)Signal processing