In statistics, maximum likelihood estimation (MLE) is a method of
estimating
Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
the
parameters
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
of an assumed
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
, given some observed data. This is achieved by
maximizing a
likelihood function
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
so that, under the assumed
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
, the
observed data is most probable. The
point in the
parameter space The parameter space is the space of possible parameter values that define a particular mathematical model, often a subset of finite-dimensional Euclidean space. Often the parameters are inputs of a function, in which case the technical term for the ...
that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of
statistical inference.
If the likelihood function is
differentiable, the
derivative test
In calculus, a derivative test uses the derivatives of a function to locate the critical points of a function and determine whether each point is a local maximum, a local minimum, or a saddle point. Derivative tests can also give information a ...
for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
estimator for a
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
model maximizes the likelihood when all observed outcomes are assumed to have
Normal distributions with the same variance.
From the perspective of
Bayesian inference, MLE is generally equivalent to
maximum a posteriori (MAP) estimation with
uniform
A uniform is a variety of clothing worn by members of an organization while participating in that organization's activity. Modern uniforms are most often worn by armed forces and paramilitary organizations such as police, emergency services, se ...
prior distributions (or a
normal prior distribution with a standard deviation of infinity). In
frequentist inference
Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...
, MLE is a special case of an
extremum estimator In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization (or minimization) of a certain objective function, which depends on the data. The general theory of e ...
, with the objective function being the likelihood.
Principles
We model a set of observations as a random
sample from an unknown joint
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
which is expressed in terms of a set of
parameters
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
. The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability. We write the parameters governing the joint distribution as a vector
so that this distribution falls within a
parametric family
In mathematics and its applications, a parametric family or a parameterized family is a family of objects (a set of related objects) whose differences depend only on the chosen values for a set of parameters.
Common examples are parametrized (fa ...
where
is called the ''
parameter space The parameter space is the space of possible parameter values that define a particular mathematical model, often a subset of finite-dimensional Euclidean space. Often the parameters are inputs of a function, in which case the technical term for the ...
'', a finite-dimensional subset of
Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean sp ...
. Evaluating the joint density at the observed data sample
gives a real-valued function,
:
which is called the
likelihood function
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
. For
independent and identically distributed random variables
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usu ...
,
will be the product of univariate
density functions:
:
The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space,
that is
:
Intuitively, this selects the parameter values that make the observed data most probable. The specific value
that maximizes the likelihood function
is called the maximum likelihood estimate. Further, if the function
so defined is
measurable
In mathematics, the concept of a measure is a generalization and formalization of geometrical measures (length, area, volume) and other common notions, such as mass and probability of events. These seemingly distinct concepts have many simila ...
, then it is called the maximum likelihood
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
. It is generally a function defined over the
sample space
In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually de ...
, i.e. taking a given sample as its argument. A
sufficient but not necessary condition for its existence is for the likelihood function to be
continuous
Continuity or continuous may refer to:
Mathematics
* Continuity (mathematics), the opposing concept to discreteness; common examples include
** Continuous probability distribution or random variable in probability and statistics
** Continuous g ...
over a parameter space
that is
compact
Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to:
* Interstate compact
* Blood compact, an ancient ritual of the Philippines
* Compact government, a type of colonial rule utilized in British ...
. For an
open
Open or OPEN may refer to:
Music
* Open (band), Australian pop/rock band
* The Open (band), English indie rock band
* ''Open'' (Blues Image album), 1969
* ''Open'' (Gotthard album), 1999
* ''Open'' (Cowboy Junkies album), 2001
* ''Open'' (Y ...
the likelihood function may increase without ever reaching a supremum value.
In practice, it is often convenient to work with the
natural logarithm
The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...
of the likelihood function, called the
log-likelihood
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
:
:
Since the logarithm is a
monotonic function
In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of ord ...
, the maximum of
occurs at the same value of
as does the maximum of
If
is
differentiable in
the
necessary conditions for the occurrence of a maximum (or a minimum) are
:
known as the likelihood equations. For some models, these equations can be explicitly solved for
but in general no closed-form solution to the maximization problem is known or available, and an MLE can only be found via
numerical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
. Another problem is that in finite samples, there may exist multiple
roots for the likelihood equations. Whether the identified root
of the likelihood equations is indeed a (local) maximum depends on whether the matrix of second-order partial and cross-partial derivatives, the so-called
Hessian matrix
In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...
:
is
negative semi-definite at
, as this indicates local
concavity. Conveniently, most common
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
s – in particular the
exponential family
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
– are
logarithmically concave.
Restricted parameter space
While the domain of the likelihood function—the
parameter space The parameter space is the space of possible parameter values that define a particular mathematical model, often a subset of finite-dimensional Euclidean space. Often the parameters are inputs of a function, in which case the technical term for the ...
—is generally a finite-dimensional subset of
Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean sp ...
, additional
restriction
Restriction, restrict or restrictor may refer to:
Science and technology
* restrict, a keyword in the C programming language used in pointer declarations
* Restriction enzyme, a type of enzyme that cleaves genetic material
Mathematics and log ...
s sometimes need to be incorporated into the estimation process. The parameter space can be expressed as
:
where
is a
vector-valued function mapping
into
Estimating the true parameter
belonging to
then, as a practical matter, means to find the maximum of the likelihood function subject to the
constraint
Constraint may refer to:
* Constraint (computer-aided design), a demarcation of geometrical characteristics between two or more entities or solid modeling bodies
* Constraint (mathematics), a condition of an optimization problem that the solution ...
Theoretically, the most natural approach to this
constrained optimization
In mathematical optimization, constrained optimization (in some contexts called constraint optimization) is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The obj ...
problem is the method of substitution, that is "filling out" the restrictions
to a set
in such a way that
is a
one-to-one function
In mathematics, an injective function (also known as injection, or one-to-one function) is a function that maps distinct elements of its domain to distinct elements; that is, implies . (Equivalently, implies in the equivalent contraposi ...
from
to itself, and reparameterize the likelihood function by setting
Because of the equivariance of the maximum likelihood estimator, the properties of the MLE apply to the restricted estimates also. For instance, in a
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements o ...
must be
positive-definite; this restriction can be imposed by replacing
where
is a real
upper triangular matrix
In mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called if all the entries ''above'' the main diagonal are zero. Similarly, a square matrix is called if all the entries ''below'' the main diagonal ar ...
and
is its
transpose
In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal;
that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other notations).
The tr ...
.
In practice, restrictions are usually imposed using the method of Lagrange which, given the constraints as defined above, leads to the ''restricted likelihood equations''
:
and
where
is a column-vector of
Lagrange multiplier
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied ...
s and
is the
Jacobian matrix
In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables ...
of partial derivatives.
Naturally, if the constraints are not binding at the maximum, the Lagrange multipliers should be zero. This in turn allows for a statistical test of the "validity" of the constraint, known as the
Lagrange multiplier test.
Properties
A maximum likelihood estimator is an
extremum estimator In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization (or minimization) of a certain objective function, which depends on the data. The general theory of e ...
obtained by maximizing, as a function of ''θ'', the
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
. If the data are
independent and identically distributed
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usua ...
, then we have
:
this being the sample analogue of the expected log-likelihood