In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, M-estimators are a broad
class
Class, Classes, or The Class may refer to:
Common uses not otherwise categorized
* Class (biology), a taxonomic rank
* Class (knowledge representation), a collection of individuals or objects
* Class (philosophy), an analytical concept used d ...
of
extremum estimator
In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative ...
s for which the
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
is a sample average. Both
non-linear least squares and
maximum likelihood estimation
In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
are special cases of M-estimators. The definition of M-estimators was motivated by
robust statistics
Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust Statistics, statistical methods have been developed for many common problems, such as estimating location parame ...
, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. The "M" initial stands for "maximum likelihood-type".
More generally, an M-estimator may be defined to be a zero of an
estimating function. This estimating function is often the derivative of another statistical function. For example, a
maximum-likelihood estimate is the point where the derivative of the likelihood function with respect to the parameter is zero; thus, a maximum-likelihood estimator is a
critical point of the
score function. In many applications, such M-estimators can be thought of as estimating characteristics of the population.
Historical motivation
The method of
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
is a prototypical M-estimator, since the estimator is defined as a minimum of the sum of squares of the residuals.
Another popular M-estimator is maximum-likelihood estimation. For a family of
probability density function
In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...
s ''f'' parameterized by ''θ'', a
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimator of ''θ'' is computed for each set of data by maximizing the
likelihood function
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
over the parameter space . When the observations are independent and identically distributed, a ML-estimate
satisfies
:
or, equivalently,
:
Maximum-likelihood estimators have optimal properties in the limit of infinitely many observations under rather general conditions, but may be biased and not the most efficient estimators for finite samples.
Definition
In 1964,
Peter J. Huber proposed generalizing maximum likelihood estimation to the minimization of
:
where ρ is a function with certain properties (see below). The solutions
:
are called M-estimators ("M" for "maximum likelihood-type" (Huber, 1981, page 43)); other types of robust estimators include
L-estimator
In statistics, an L-estimator (or L-statistic) is an estimator which is a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median (of an odd number of values), or as many as all points ...
s,
R-estimators and
S-estimators. Maximum likelihood estimators (MLE) are thus a special case of M-estimators. With suitable rescaling, M-estimators are special cases of
extremum estimator
In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative ...
s (in which more general functions of the observations can be used).
The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, ''close'' to the assumed distribution.
Types
M-estimators are solutions, ''θ'', which minimize
:
This minimization can always be done directly. Often it is simpler to differentiate with respect to ''θ'' and solve for the root of the derivative. When this differentiation is possible, the M-estimator is said to be of ψ-type. Otherwise, the M-estimator is said to be of ρ-type.
In most practical cases, the M-estimators are of ψ-type.
ρ-type
For positive integer ''r'', let
and
be measure spaces.
is a vector of parameters. An M-estimator of ρ-type
is defined through a
measurable function
In mathematics, and in particular measure theory, a measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in ...
. It maps a probability distribution
on
to the value
(if it exists) that minimizes
:
:
For example, for the
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimator,
, where
.
ψ-type
If
is differentiable with respect to
, the computation of
is usually much easier. An M-estimator of ψ-type ''T'' is defined through a measurable function
. It maps a probability distribution ''F'' on
to the value
(if it exists) that solves the vector equation:
:
:
For example, for the
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimator,
, where
denotes the transpose of vector ''u'' and
.
Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to
, then a necessary condition for an M-estimator of ψ-type to be an M-estimator of ρ-type is
. The previous definitions can easily be extended to finite samples.
If the function ψ decreases to zero as
, the estimator is called
redescending. Such estimators have some additional desirable properties, such as complete rejection of gross outliers.
Computation
For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required. It is possible to use standard function optimization algorithms, such as
Newton–Raphson. However, in most cases an
iteratively re-weighted least squares fitting algorithm can be performed; this is typically the preferred method.
For some choices of ψ, specifically, ''
redescending'' functions, the solution may not be unique. The issue is particularly relevant in multivariate and regression problems. Thus, some care is needed to ensure that good starting points are chosen.
Robust starting points, such as the
median
The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
as an estimate of location and the
median absolute deviation as a univariate estimate of scale, are common.
Concentrating parameters
In computation of M-estimators, it is sometimes useful to rewrite the
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
so that the dimension of parameters is reduced. The procedure is called “concentrating” or “profiling”. Examples in which concentrating parameters increases computation speed include
seemingly unrelated regressions
In econometrics
Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics", '' The New Palgrave: A Dictionary of Economics'', ...
(SUR) models.
Consider the following M-estimation problem:
:
Assuming differentiability of the function ''q'', M-estimator solves the first order conditions:
:
:
Now, if we can solve the second equation for γ in terms of
and
, the second equation becomes:
:
where g is, there is some function to be found. Now, we can rewrite the original objective function solely in terms of β by inserting the function g into the place of
. As a result, there is a reduction in the number of parameters.
Whether this procedure can be done depends on particular problems at hand. However, when it is possible, concentrating parameters can facilitate computation to a great degree. For example, in estimating
SUR model of 6 equations with 5 explanatory variables in each equation by Maximum Likelihood, the number of parameters declines from 51 to 30.
Despite its appealing feature in computation, concentrating parameters is of limited use in deriving asymptotic properties of M-estimator. The presence of W in each summand of the objective function makes it difficult to apply the
law of large numbers
In probability theory, the law of large numbers is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the law o ...
and the
central limit theorem
In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the Probability distribution, distribution of a normalized version of the sample mean converges to a Normal distribution#Standard normal distributi ...
.
Properties
Distribution
It can be shown that M-estimators are asymptotically normally distributed. As such,
Wald-type approaches to constructing confidence intervals and hypothesis tests can be used. However, since the theory is asymptotic, it will frequently be sensible to check the distribution, perhaps by examining the permutation or
bootstrap distribution.
Influence function
The influence function of an M-estimator of
-type is proportional to its defining
function.
Let ''T'' be an M-estimator of ψ-type, and ''G'' be a probability distribution for which
is defined. Its influence function IF is
:
assuming the density function
exists. A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).
Applications
M-estimators can be constructed for location parameters and scale parameters in univariate and multivariate settings, as well as being used in robust regression.
Examples
Mean
Let (''X''
1, ..., ''X''
''n'') be a set of
independent, identically distributed random variables, with distribution ''F''.
If we define
:
we note that this is minimized when ''θ'' is the
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
of the ''X''s. Thus the mean is an M-estimator of ρ-type, with this ρ function.
As this ρ function is continuously differentiable in ''θ'', the mean is thus also an M-estimator of ψ-type for ψ(''x'', ''θ'') = ''θ'' − ''x''.
Median
For the median estimation of (''X''
1, ..., ''X''
''n''), instead we can define the ρ function as
:
and similarly, the ρ function is minimized when ''θ'' is the
median
The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
of the ''X''s.
While this ρ function is not differentiable in ''θ'', the ψ-type M-estimator, which is the
subgradient
In mathematics, the subderivative (or subgradient) generalizes the derivative to convex functions which are not necessarily differentiable. The set of subderivatives at a point is called the subdifferential at that point. Subderivatives arise in c ...
of ρ function, can be expressed as
:
and
:
Sufficient conditions for statistical consistency
M-estimators are
consistent
In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...
under various sets of conditions. A typical set of assumptions is the class of functions satisfies a
uniform law of large numbers and that the maximum is well-separated. Specifically, given an empirical and population objective
, respectively, as
:
and for every
:
where
is a
distance function and
is the optimum, then M-estimation is consistent.
The uniform convergence constraint is not necessarily required; an alternate set of assumptions is to instead consider pointwise convergence (
in probability) of the objective functions. Additionally, assume that each of the
has continuous derivative with exactly one zero or has a derivative which is non-decreasing and is asymptotically order
. Finally, assume that the maximum
is well-separated. Then M-estimation is consistent.
[Vaart AW van der. Asymptotic Statistics. Cambridge University Press; 1998.]
See also
*
Two-step M-estimator
*
Robust statistics
Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust Statistics, statistical methods have been developed for many common problems, such as estimating location parame ...
*
Robust regression
In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of re ...
*
Redescending M-estimator
*
S-estimator
*
Fréchet mean
References
Further reading
*
*
*
*
*
*
*
*
*
*
*
*
*
*
External links
M-estimators— an introduction to the subject by Zhengyou Zhang
{{DEFAULTSORT:M-Estimator
M-estimators
Estimator
Robust regression
Robust statistics