In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, maximum likelihood estimation (MLE) is a method of
estimating the
parameters of an assumed
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
, given some observed data. This is achieved by
maximizing a
likelihood function
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
so that, under the assumed
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
, the
observed data is most probable. The
point in the
parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
.
If the likelihood function is
differentiable, the
derivative test for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the
ordinary least squares estimator for a
linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
model maximizes the likelihood when the random errors are assumed to have
normal distributions with the same variance.
From the perspective of
Bayesian inference, MLE is generally equivalent to
maximum a posteriori (MAP) estimation with a
prior distribution that is
uniform
A uniform is a variety of costume worn by members of an organization while usually participating in that organization's activity. Modern uniforms are most often worn by armed forces and paramilitary organizations such as police, emergency serv ...
in the region of interest. In
frequentist inference
Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pr ...
, MLE is a special case of an
extremum estimator, with the objective function being the likelihood.
Principles
We model a set of observations as a random
sample from an unknown
joint probability distribution
A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...
which is expressed in terms of a set of
parameters. The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability. We write the parameters governing the joint distribution as a vector
so that this distribution falls within a
parametric family where
is called the ''
parameter space'', a finite-dimensional subset of
Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
. Evaluating the joint density at the observed data sample
gives a real-valued function,
which is called the
likelihood function
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
. For
independent random variables,
will be the product of univariate
density functions:
The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space,
that is:
Intuitively, this selects the parameter values that make the observed data most probable. The specific value
that maximizes the likelihood function
is called the maximum likelihood estimate. Further, if the function
so defined is
measurable, then it is called the maximum likelihood
estimator. It is generally a function defined over the
sample space, i.e. taking a given sample as its argument. A
sufficient but not necessary condition for its existence is for the likelihood function to be
continuous over a parameter space
that is
compact
Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to:
* Interstate compact, a type of agreement used by U.S. states
* Blood compact, an ancient ritual of the Philippines
* Compact government, a t ...
. For an
open
Open or OPEN may refer to:
Music
* Open (band), Australian pop/rock band
* The Open (band), English indie rock band
* ''Open'' (Blues Image album), 1969
* ''Open'' (Gerd Dudek, Buschi Niebergall, and Edward Vesala album), 1979
* ''Open'' (Go ...
the likelihood function may increase without ever reaching a supremum value.
In practice, it is often convenient to work with the
natural logarithm
The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...
of the likelihood function, called the
log-likelihood:
Since the logarithm is a
monotonic function
In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of or ...
, the maximum of
occurs at the same value of
as does the maximum of
If
is
differentiable in
sufficient conditions for the occurrence of a maximum (or a minimum) are
known as the likelihood equations. For some models, these equations can be explicitly solved for
but in general no closed-form solution to the maximization problem is known or available, and an MLE can only be found via
numerical optimization. Another problem is that in finite samples, there may exist multiple
roots
A root is the part of a plant, generally underground, that anchors the plant body, and absorbs and stores water and nutrients.
Root or roots may also refer to:
Art, entertainment, and media
* ''The Root'' (magazine), an online magazine focusin ...
for the likelihood equations. Whether the identified root
of the likelihood equations is indeed a (local) maximum depends on whether the matrix of second-order partial and cross-partial derivatives, the so-called
Hessian matrix
is
negative semi-definite at
, as this indicates local
concavity. Conveniently, most common
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s – in particular the
exponential family – are
logarithmically concave.
Restricted parameter space
While the domain of the likelihood function—the
parameter space—is generally a finite-dimensional subset of
Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
, additional
restrictions sometimes need to be incorporated into the estimation process. The parameter space can be expressed as
where
is a
vector-valued function
A vector-valued function, also referred to as a vector function, is a mathematical function of one or more variables whose range is a set of multidimensional vectors or infinite-dimensional vectors. The input of a vector-valued function could ...
mapping
into
Estimating the true parameter
belonging to
then, as a practical matter, means to find the maximum of the likelihood function subject to the
constraint
Theoretically, the most natural approach to this
constrained optimization
In mathematical optimization, constrained optimization (in some contexts called constraint optimization) is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The obj ...
problem is the method of substitution, that is "filling out" the restrictions
to a set
in such a way that
is a
one-to-one function
In mathematics, an injective function (also known as injection, or one-to-one function ) is a function that maps distinct elements of its domain to distinct elements of its codomain; that is, implies (equivalently by contraposition, impl ...
from
to itself, and reparameterize the likelihood function by setting
Because of the equivariance of the maximum likelihood estimator, the properties of the MLE apply to the restricted estimates also. For instance, in a
multivariate normal distribution the
covariance matrix must be
positive-definite; this restriction can be imposed by replacing
where
is a real
upper triangular matrix and
is its
transpose
In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal;
that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...
.
In practice, restrictions are usually imposed using the method of Lagrange which, given the constraints as defined above, leads to the ''restricted likelihood equations''
and
where
is a column-vector of
Lagrange multiplier
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function (mathematics), function subject to constraint (mathematics), equation constraints (i.e., subject to the conditio ...
s and
is the
Jacobian matrix
In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. If this matrix is square, that is, if the number of variables equals the number of component ...
of partial derivatives.
Naturally, if the constraints are not binding at the maximum, the Lagrange multipliers should be zero. This in turn allows for a statistical test of the "validity" of the constraint, known as the
Lagrange multiplier test.
Nonparametric maximum likelihood estimation
Nonparametric maximum likelihood estimation can be performed using the
empirical likelihood.
Properties
A maximum likelihood estimator is an
extremum estimator obtained by maximizing, as a function of ''θ'', the
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
. If the data are
independent and identically distributed, then we have
this being the sample analogue of the expected log-likelihood