Algorithmic inference gathers new developments in the
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
methods made feasible by the powerful computing devices widely available to any data analyst. Cornerstones in this field are
computational learning theory
In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms.
Overview
Theoretical results in machine learning m ...
,
granular computing
Granular computing is an emerging computing paradigm of Data processing, information processing that concerns the processing of complex information entities called "information granulation, granules", which arise in the process of data abstractio ...
,
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, and, long ago, structural probability .
The main focus is on the algorithms which compute statistics rooting the study of a random phenomenon, along with the amount of data they must feed on to produce reliable results. This shifts the interest of mathematicians from the study of the
distribution laws to the functional properties of the
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and the interest of computer scientists from the algorithms for processing data to the
information
Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...
they process.
The Fisher parametric inference problem
Concerning the identification of the parameters of a distribution law, the mature reader may recall lengthy disputes in the mid 20th century about the interpretation of their variability in terms of
fiducial distribution , structural probabilities , priors/posteriors , and so on. From an
epistemology
Epistemology is the branch of philosophy that examines the nature, origin, and limits of knowledge. Also called "the theory of knowledge", it explores different types of knowledge, such as propositional knowledge about facts, practical knowle ...
viewpoint, this entailed a companion dispute as to the nature of
probability
Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
: is it a physical feature of phenomena to be described through
random variables
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...
or a way of synthesizing data about a phenomenon? Opting for the latter, Fisher defines a ''fiducial distribution'' law of parameters of a given random variable that he deduces from a sample of its specifications. With this law he computes, for instance "the probability that μ (mean of a
Gaussian variable – omeur note) is less than any assigned value, or the probability that it lies between any assigned values, or, in short, its probability distribution, in the light of the sample observed".
The classic solution
Fisher fought hard to defend the difference and superiority of his notion of parameter distribution in comparison to
analogous notions, such as Bayes'
posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
, Fraser's constructive probability and Neyman's
confidence intervals. For half a century, Neyman's confidence intervals won out for all practical purposes, crediting the phenomenological nature of probability. With this perspective, when you deal with a Gaussian variable, its mean μ is fixed by the physical features of the phenomenon you are observing, where the observations are random operators, hence the observed values are specifications of a
random sample. Because of their randomness, you may compute from the sample specific intervals containing the fixed μ with a given probability that you denote ''confidence''.
Example
Let ''X'' be a Gaussian variable
[By default, capital letters (such as ''U'', ''X'') will denote random variables and small letters (''u'', ''x'') their corresponding specifications.] with parameters
and
and
a sample drawn from it. Working with statistics
:
and
:
is the sample mean, we recognize that
:
follows a
Student's t distribution with parameter (degrees of freedom) ''m'' − 1, so that
:
Gauging ''T'' between two quantiles and inverting its expression as a function of
you obtain confidence intervals for
.
With the sample specification:
:
having size ''m'' = 10, you compute the statistics
and
, and obtain a 0.90 confidence interval for
with extremes (3.03, 5.65).
Inferring functions with the help of a computer
From a modeling perspective the entire dispute looks like a chicken-egg dilemma: either fixed data by first and probability distribution of their properties as a consequence, or fixed properties by first and probability distribution of the observed data as a corollary.
The classic solution has one benefit and one drawback. The former was appreciated particularly back when people still did computations with sheet and pencil. Per se, the task of computing a Neyman confidence interval for the fixed parameter θ is hard: you do not know θ, but you look for disposing around it an interval with a possibly very low probability of failing. The analytical solution is allowed for a very limited number of theoretical cases. ''Vice versa'' a large variety of instances may be quickly solved in an ''approximate way'' via the
central limit theorem
In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the Probability distribution, distribution of a normalized version of the sample mean converges to a Normal distribution#Standard normal distributi ...
in terms of confidence interval around a Gaussian distribution – that's the benefit.
The drawback is that the central limit theorem is applicable when the sample size is sufficiently large. Therefore, it is less and less applicable with the sample involved in modern inference instances. The fault is not in the sample size on its own part. Rather, this size is not sufficiently large because of the
complexity
Complexity characterizes the behavior of a system or model whose components interact in multiple ways and follow local rules, leading to non-linearity, randomness, collective dynamics, hierarchy, and emergence.
The term is generally used to c ...
of the inference problem.
With the availability of large computing facilities, scientists refocused from isolated parameters inference to complex functions inference, i.e. re sets of highly nested parameters identifying functions. In these cases we speak about ''learning of functions'' (in terms for instance of
regression,
neuro-fuzzy system or
computational learning) on the basis of highly informative samples. A first effect of having a complex structure linking data is the reduction of the number of sample
degrees of freedom
In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
, i.e. the burning of a part of sample points, so that the effective sample size to be considered in the central limit theorem is too small. Focusing on the sample size ensuring a limited learning error with a given
confidence level, the consequence is that the lower bound on this size grows with
complexity indices such as
VC dimension
VC may refer to:
Military decorations
* Victoria Cross, a military decoration awarded by the United Kingdom and other Commonwealth nations
** Victoria Cross for Australia
** Victoria Cross (Canada)
** Victoria Cross for New Zealand
* Victorious ...
or
detail of a class to which the function we want to learn belongs.
Example
A sample of 1,000 independent bits is enough to ensure an absolute error of at most 0.081 on the estimation of the parameter ''p'' of the underlying Bernoulli variable with a confidence of at least 0.99. The same size cannot guarantee a threshold less than 0.088 with the same confidence 0.99 when the error is identified with the probability that a 20-year-old man living in New York does not fit the ranges of height, weight and waistline observed on 1,000 Big Apple inhabitants. The accuracy shortage occurs because both the VC dimension and the detail of the class of parallelepipeds, among which the one observed from the 1,000 inhabitants' ranges falls, are equal to 6.
The general inversion problem solving the Fisher question
With insufficiently large samples, the approach: ''fixed sample – random properties'' suggests inference procedures in three steps:
Definition
For a random variable and a sample drawn from it a ''compatible distribution'' is a distribution having the same
sampling mechanism of ''X'' with a value
of the random parameter
derived from a master equation rooted on a well-behaved statistic ''s''.
Example
You may find the distribution law of the Pareto parameters ''A'' and ''K'' as an implementation example of the
population bootstrap method as in the figure on the left.
Implementing the
twisting argument method, you get the distribution law
of the mean ''M'' of a Gaussian variable ''X'' on the basis of the statistic
when
is known to be equal to
. Its expression is:
:
shown in the figure on the right, where
is the
cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ever ...
of a
standard normal distribution.
Computing a
confidence interval for ''M'' given its distribution function is straightforward: we need only find two quantiles (for instance
and
quantiles in case we are interested in a confidence interval of level δ symmetric in the tail's probabilities) as indicated on the left in the diagram showing the behavior of the two bounds for different values of the statistic ''s''
''m''.
The Achilles heel of Fisher's approach lies in the joint distribution of more than one parameter, say mean and variance of a Gaussian distribution. On the contrary, with the last approach (and above-mentioned methods:
population bootstrap and
twisting argument) we may learn the joint distribution of many parameters. For instance, focusing on the distribution of two or many more parameters, in the figures below we report two confidence regions where the function to be learnt falls with a confidence of 90%. The former concerns the probability with which an extended
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
attributes a binary label 1 to the points of the
plane. The two surfaces are drawn on the basis of a set of sample points in turn labelled according to a specific distribution law . The latter concerns the confidence region of the hazard rate of breast cancer recurrence computed from a censored sample .
Notes
References
*
*
*
*
*
*{{Citation
, last=Wilks , first=S.S.
, title=Mathematical Statistics
, series=Wiley Publications in Statistics
, publisher=John Wiley
, location=New York
, year=1962
Machine learning