The principle of maximum entropy states that the
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
which best represents the current state of knowledge about a system is the one with largest
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
, in the context of precisely stated prior data (such as a
proposition
A proposition is a statement that can be either true or false. It is a central concept in the philosophy of language, semantics, logic, and related fields. Propositions are the object s denoted by declarative sentences; for example, "The sky ...
that expresses
testable information).
Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. According to this principle, the distribution with maximal
information entropy
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
is the best choice.
History
The principle was first expounded by
E. T. Jaynes in two papers in 1957, where he emphasized a natural correspondence between
statistical mechanics
In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...
and
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
. In particular, Jaynes argued that the Gibbsian method of statistical mechanics is sound by also arguing that the
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
of statistical mechanics and the
information entropy
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
of
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
are the same concept. Consequently,
statistical mechanics
In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...
should be considered a particular application of a general tool of logical
inference
Inferences are steps in logical reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinct ...
and information theory.
Overview
In most practical cases, the stated prior data or testable information is given by a set of
conserved quantities (average values of some moment functions), associated with the
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
in question. This is the way the maximum entropy principle is most often used in
statistical thermodynamics. Another possibility is to prescribe some
symmetries of the probability distribution. The equivalence between
conserved quantities and corresponding
symmetry group
In group theory, the symmetry group of a geometric object is the group of all transformations under which the object is invariant, endowed with the group operation of composition. Such a transformation is an invertible mapping of the amb ...
s implies a similar equivalence for these two ways of specifying the testable information in the maximum entropy method.
The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods,
statistical mechanics
In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...
and
logical inference in particular.
The maximum entropy principle makes explicit our freedom in using different forms of
prior data. As a special case, a uniform
prior probability
A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
density (Laplace's
principle of indifference, sometimes called the principle of insufficient reason), may be adopted. Thus, the maximum entropy principle is not merely an alternative way to view the usual methods of inference of classical statistics, but represents a significant conceptual generalization of those methods.
However these statements do not imply that thermodynamical systems need not be shown to be
ergodic to justify treatment as a
statistical ensemble
In physics, specifically statistical mechanics, an ensemble (also statistical ensemble) is an idealization consisting of a large number of virtual copies (sometimes infinitely many) of a system, considered all at once, each of which represents a ...
.
In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.
Testable information
The principle of maximum entropy is useful explicitly only when applied to ''testable information''. Testable information is a statement about a probability distribution whose truth or falsity is well-defined. For example, the statements
:the
expectation of the variable
is 2.87
and
:
(where
and
are probabilities of events) are statements of testable information.
Given testable information, the maximum entropy procedure consists of seeking the
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
which maximizes
information entropy
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of
Lagrange multiplier
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function (mathematics), function subject to constraint (mathematics), equation constraints (i.e., subject to the conditio ...
s.
Entropy maximization with no testable information respects the universal "constraint" that the sum of the probabilities is one. Under this constraint, the maximum entropy discrete probability distribution is the
uniform distribution,
:
Applications
The principle of maximum entropy is commonly applied in two ways to inferential problems:
Prior probabilities
The principle of maximum entropy is often used to obtain
prior probability distributions for
Bayesian inference
Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
. Jaynes was a strong advocate of this approach, claiming the maximum entropy distribution represented the least informative distribution.
A large amount of literature is now dedicated to the elicitation of maximum entropy priors and links with
channel coding
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for error control, controlling errors in data transmission over unreliable or noisy communication channel ...
.
Posterior probabilities
Maximum entropy is a sufficient updating rule for
radical probabilism.
Richard Jeffrey
Richard Carl Jeffrey (August 5, 1926 – November 9, 2002) was an American philosopher, logician, and probability theorist. He is best known for developing and championing the philosophy of radical probabilism and the associated heuristic of ...
's
probability kinematics is a special case of
maximum entropy inference. However, maximum entropy is not a generalisation of all such sufficient updating rules.
Maximum entropy models
Alternatively, the principle is often invoked for model specification: in this case the observed data itself is assumed to be the testable information. Such models are widely used in
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. An example of such a model is
logistic regression
In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
, which corresponds to the
maximum entropy classifier for independent observations.
Probability density estimation
One of the main applications of the maximum entropy principle is in discrete and continuous
density estimation.
Similar to
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
estimators,
the maximum entropy principle may require the solution to a
quadratic programming problem, and thus provide
a sparse mixture model as the optimal density estimator. One important advantage of the method is its ability to incorporate prior information in the density estimation.
General solution for the maximum entropy distribution with linear constraints
Discrete case
We have some testable information ''I'' about a quantity ''x'' taking values in . We assume this information has the form of ''m'' constraints on the expectations of the functions ''f
k''; that is, we require our probability distribution to satisfy the moment inequality/equality constraints:
:
where the
are observables. We also require the probability density to sum to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint
:
The probability distribution with maximum information entropy subject to these inequality/equality constraints is of the form:
:
for some
. It is sometimes called the
Gibbs distribution. The normalization constant is determined by:
:
and is conventionally called the
partition function. (The
Pitman–Koopman theorem states that the necessary and sufficient condition for a
sampling distribution to admit
sufficient statistics of bounded dimension is that it have the general form of a maximum entropy distribution.)
The λ
k parameters are Lagrange multipliers. In the case of equality constraints their values are determined from the solution of the nonlinear equations
:
In the case of inequality constraints, the Lagrange multipliers are determined from the solution of a
convex optimization
Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization problems ...
program with linear constraints.
In both cases, there is no
closed form solution, and the computation of the Lagrange multipliers usually requires
numerical methods
Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods t ...
.
Continuous case
For
continuous distributions, the Shannon entropy cannot be used, as it is only defined for discrete probability spaces. Instead
Edwin Jaynes
Edwin Thompson Jaynes (July 5, 1922 – April 30, 1998) was the Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis. He wrote extensively on statistical mechanics and on foundations of probability and statistical ...
(1963, 1968, 2003) gave the following formula, which is closely related to the
relative entropy (see also
differential entropy).
:
where ''q''(''x''), which Jaynes called the "
invariant measure", is proportional to the
limiting density of discrete points. For now, we shall assume that ''q'' is known; we will discuss it further after the solution equations are given.
A closely related quantity, the relative entropy, is usually defined as the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
of ''p'' from ''q'' (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as the
Principle of Minimum Discrimination Information.
We have some testable information ''I'' about a quantity ''x'' which takes values in some
interval of the
real numbers
In mathematics, a real number is a number that can be used to measurement, measure a continuous variable, continuous one-dimensional quantity such as a time, duration or temperature. Here, ''continuous'' means that pairs of values can have arbi ...
(all integrals below are over this interval). We assume this information has the form of ''m'' constraints on the expectations of the functions ''f
k'', i.e. we require our probability density function to satisfy the inequality (or purely equality) moment constraints:
:
where the
are observables. We also require the probability density to integrate to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint
:
The probability density function with maximum ''H
c'' subject to these constraints is:
: