HOME

TheInfoList



OR:

In
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, the bias of an estimator (or bias function) is the difference between this
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
's
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
and the
true value In statistics, as opposed to its general use in mathematics, a parameter is any measured quantity of a statistical population that summarises or describes an aspect of the population, such as a mean or a standard deviation. If a population exa ...
of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In statistics, "bias" is an property of an estimator. Bias is a distinct concept from
consistency In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent ...
: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more. All else being equal, an unbiased estimator is preferable to a biased estimator, although in practice, biased estimators (with generally small bias) are frequently used. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population; because an estimator is difficult to compute (as in
unbiased estimation of standard deviation In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation (a measure of statistical dispersion) of a population of val ...
); because a biased estimator may be unbiased with respect to different measures of
central tendency In statistics, a central tendency (or measure of central tendency) is a central or typical value for a probability distribution.Weisberg H.F (1992) ''Central Tendency and Variability'', Sage University Paper Series on Quantitative Applications in ...
; because a biased estimator gives a lower value of some
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
(particularly
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
) compared with unbiased estimators (notably in shrinkage estimators); or because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful. Bias can also be measured with respect to the
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
, rather than the mean (expected value), in which case one distinguishes ''median''-unbiased from the usual ''mean''-unbiasedness property. Mean-unbiasedness is not preserved under non-linear transformations, though median-unbiasedness is (see ); for example, the
sample variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
is a biased estimator for the population variance. These are all illustrated below.


Definition

Suppose we have a
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
, parameterized by a real number ''θ'', giving rise to a probability distribution for observed data, P_\theta(x) = P(x\mid\theta), and a statistic \hat\theta which serves as an
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of ''θ'' based on any observed data x. That is, we assume that our data follow some unknown distribution P(x\mid\theta) (where ''θ'' is a fixed, unknown constant that is part of this distribution), and then we construct some estimator \hat\theta that maps observed data to values that we hope are close to ''θ''. The bias of \hat\theta relative to \theta is defined as : \operatorname(\hat\theta, \theta) =\operatorname_\theta ,\hat\theta\,= \operatorname_ ,\hat\,\theta = \operatorname_ , \hat\theta - \theta \, where \operatorname_ denotes
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
over the distribution P(x\mid\theta) (i.e., averaging over all possible observations x). The second equation follows since ''θ'' is measurable with respect to the conditional distribution P(x\mid\theta). An estimator is said to be unbiased if its bias is equal to zero for all values of parameter ''θ'', or equivalently, if the expected value of the estimator matches that of the parameter. In a simulation experiment concerning the properties of an estimator, the bias of the estimator may be assessed using the mean signed difference.


Examples


Sample variance

The
sample variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
of a random variable demonstrates two aspects of estimator bias: firstly, the naive estimator is biased, which can be corrected by a scale factor; second, the unbiased estimator is not optimal in terms of
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
(MSE), which can be minimized by using a different scale factor, resulting in a biased estimator with lower MSE than the unbiased estimator. Concretely, the naive estimator sums the squared deviations and divides by ''n,'' which is biased. Dividing instead by ''n'' − 1 yields an unbiased estimator. Conversely, MSE can be minimized by dividing by a different number (depending on distribution), but this results in a biased estimator. This number is always larger than ''n'' − 1, so this is known as a shrinkage estimator, as it "shrinks" the unbiased estimator towards zero; for the normal distribution the optimal value is ''n'' + 1. Suppose ''X''1, ..., ''X''''n'' are
independent and identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usual ...
(i.i.d.) random variables with expectation ''μ'' and
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
''σ''2. If the
sample mean The sample mean (or "empirical mean") and the sample covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or mean value) of a sample of numbers taken from a larger popu ...
and uncorrected
sample variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
are defined as :\overline\,=\frac 1 n \sum_^n X_i \qquad S^2=\frac 1 n \sum_^n\big(X_i-\overline\,\big)^2 \qquad then ''S''2 is a biased estimator of ''σ''2, because : \begin \operatorname ^2 &= \operatorname\left \frac 1 n \sum_^n \big(X_i-\overline\big)^2 \right = \operatorname\bigg \frac 1 n \sum_^n \bigg((X_i-\mu)-(\overline-\mu)\bigg)^2 \bigg\\ pt &= \operatorname\bigg \frac 1 n \sum_^n \bigg((X_i-\mu)^2 - 2(\overline-\mu)(X_i-\mu) + (\overline-\mu)^2\bigg) \bigg\\ pt &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 - \frac 2 n (\overline-\mu) \sum_^n (X_i-\mu) + \frac 1 n (\overline-\mu)^2 \sum_^n 1 \bigg\\ pt &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 - \frac 2 n (\overline-\mu)\sum_^n (X_i-\mu) + \frac 1 n (\overline-\mu)^2 \cdot n\bigg\\ pt &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 - \frac 2 n (\overline-\mu)\sum_^n (X_i-\mu) + (\overline-\mu)^2 \bigg\\ pt \end To continue, we note that by subtracting \mu from both sides of \overline= \frac 1 n \sum_^nX_i, we get : \begin \overline-\mu = \frac 1 n \sum_^n X_i - \mu = \frac 1 n \sum_^n X_i - \frac 1 n \sum_^n\mu\ = \frac 1 n \sum_^n (X_i - \mu).\\ pt \end Meaning, (by cross-multiplication) n \cdot (\overline-\mu)=\sum_^n (X_i-\mu). Then, the previous becomes: : \begin \operatorname ^2 &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 - \frac 2 n (\overline-\mu)\sum_^n (X_i-\mu) + (\overline-\mu)^2 \bigg\ pt &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 - \frac 2 n (\overline-\mu) \cdot n \cdot (\overline-\mu)+ (\overline-\mu)^2 \bigg\\ pt &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 - 2(\overline-\mu)^2 + (\overline-\mu)^2 \bigg\\ pt &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 - (\overline-\mu)^2 \bigg\\ pt &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2\bigg- \operatorname\bigg \overline-\mu)^2 \bigg\\ pt &= \sigma^2 - \operatorname\bigg \overline-\mu)^2 \bigg = \left( 1 -\frac\right) \sigma^2 < \sigma^2. \end This can be seen by noting the following formula, which follows from the Bienaymé formula, for the term in the inequality for the expectation of the uncorrected sample variance above: \operatorname\big (\overline-\mu)^2 \big= \frac 1 n \sigma^2. In other words, the expected value of the uncorrected sample variance does not equal the population variance ''σ''2, unless multiplied by a normalization factor. The sample mean, on the other hand, is an unbiased estimator of the population mean ''μ''. Note that the usual definition of sample variance is S^2=\frac 1 \sum_^n(X_i-\overline\,)^2, and this is an unbiased estimator of the population variance. Algebraically speaking, \operatorname ^2 is unbiased because: : \begin \operatorname ^2 &= \operatorname\left \frac 1 \sum_^n \big(X_i-\overline\big)^2 \right = \frac\operatorname\left \frac 1 \sum_^n \big(X_i-\overline\big)^2 \right\\ pt &= \frac\left( 1 -\frac\right) \sigma^2 = \sigma^2, \\ pt \end where the transition to the second line uses the result derived above for the biased estimator. Thus \operatorname ^2= \sigma^2, and therefore S^2=\frac 1 \sum_^n(X_i-\overline\,)^2 is an unbiased estimator of the population variance, ''σ''2. The ratio between the biased (uncorrected) and unbiased estimates of the variance is known as
Bessel's correction In statistics, Bessel's correction is the use of ''n'' − 1 instead of ''n'' in the formula for the sample variance and sample standard deviation, where ''n'' is the number of observations in a sample. This method corrects the bias in ...
. The reason that an uncorrected sample variance, ''S''2, is biased stems from the fact that the sample mean is an
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
(OLS) estimator for ''μ'': \overline is the number that makes the sum \sum_^n (X_i-\overline)^2 as small as possible. That is, when any other number is plugged into this sum, the sum can only increase. In particular, the choice \mu \ne \overline gives, : \frac 1 n \sum_^n (X_i-\overline)^2 < \frac 1 n \sum_^n (X_i-\mu)^2, and then : \begin \operatorname ^2 &= \operatorname\bigg \frac 1 n \sum_^n (X_i-\overline)^2 \bigg < \operatorname\bigg \frac 1 n \sum_^n (X_i-\mu)^2 \bigg= \sigma^2. \end The above discussion can be understood in geometric terms: the vector \vec=(X_1 -\mu, \ldots, X_n-\mu) can be decomposed into the "mean part" and "variance part" by projecting to the direction of \vec=(1,\ldots, 1) and to that direction's orthogonal complement hyperplane. One gets \vec=(\overline-\mu, \ldots, \overline-\mu) for the part along \vec and \vec=(X_1-\overline, \ldots, X_n-\overline) for the complementary part. Since this is an orthogonal decomposition, Pythagorean theorem says , \vec, ^2= , \vec, ^2+ , \vec, ^2, and taking expectations we get n \sigma^2 = n \operatorname\left (\overline-\mu)^2 \right+n \operatorname ^2, as above (but times n). If the distribution of \vec is rotationally symmetric, as in the case when X_i are sampled from a Gaussian, then on average, the dimension along \vec contributes to , \vec, ^2 equally as the n-1 directions perpendicular to \vec, so that \operatorname\left (\overline-\mu)^2 \right=\frac n and \operatorname ^2=\frac n . This is in fact true in general, as explained above.


Estimating a Poisson probability

A far more extreme case of a biased estimator being better than any unbiased estimator arises from the
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
. Suppose that ''X'' has a Poisson distribution with expectation ''λ''. Suppose it is desired to estimate :\operatorname(X=0)^2=e^\quad with a sample of size 1. (For example, when incoming calls at a telephone switchboard are modeled as a Poisson process, and ''λ'' is the average number of calls per minute, then ''e''−2''λ'' is the probability that no calls arrive in the next two minutes.) Since the expectation of an unbiased estimator ''δ''(''X'') is equal to the
estimand An estimand is a quantity that is to be estimated in a statistical analysis. The term is used to more clearly distinguish the target of inference from the method used to obtain an approximation of this target (i.e., the estimator) and the specific v ...
, i.e. :\operatorname E(\delta(X))=\sum_^\infty \delta(x) \frac = e^, the only function of the data constituting an unbiased estimator is :\delta(x)=(-1)^x. \, To see this, note that when decomposing e−''λ'' from the above expression for expectation, the sum that is left is a
Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
expansion of e−''λ'' as well, yielding e−''λ''e−''λ'' = e−2''λ'' (see
Characterizations of the exponential function In mathematics, the exponential function can be characterized in many ways. The following characterizations (definitions) are most common. This article discusses why each characterization makes sense, and why the characterizations are independent o ...
). If the observed value of ''X'' is 100, then the estimate is 1, although the true value of the quantity being estimated is very likely to be near 0, which is the opposite extreme. And, if ''X'' is observed to be 101, then the estimate is even more absurd: It is −1, although the quantity being estimated must be positive. The (biased)
maximum likelihood estimator In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
:e^\quad is far better than this unbiased estimator. Not only is its value always positive but it is also more accurate in the sense that its
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
:e^-2e^+e^ \, is smaller; compare the unbiased estimator's MSE of :1-e^. \, The MSEs are functions of the true value ''λ''. The bias of the maximum-likelihood estimator is: :e^-e^. \,


Maximum of a discrete uniform distribution

The bias of maximum-likelihood estimators can be substantial. Consider a case where ''n'' tickets numbered from 1 through to ''n'' are placed in a box and one is selected at random, giving a value ''X''. If ''n'' is unknown, then the maximum-likelihood estimator of ''n'' is ''X'', even though the expectation of ''X'' given ''n'' is only (''n'' + 1)/2; we can be certain only that ''n'' is at least ''X'' and is probably more. In this case, the natural unbiased estimator is 2''X'' − 1.


Median-unbiased estimators

The theory of
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
-unbiased estimators was revived by George W. Brown in 1947: Further properties of median-unbiased estimators have been noted by Lehmann, Birnbaum, van der Vaart and Pfanzagl. In particular, median-unbiased estimators exist in cases where mean-unbiased and
maximum-likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
estimators do not exist. They are invariant under one-to-one transformations. There are methods of construction median-unbiased estimators for probability distributions that have monotone likelihood-functions, such as one-parameter exponential families, to ensure that they are optimal (in a sense analogous to minimum-variance property considered for mean-unbiased estimators). One such procedure is an analogue of the Rao–Blackwell procedure for mean-unbiased estimators: The procedure holds for a smaller class of probability distributions than does the Rao–Blackwell procedure for mean-unbiased estimation but for a larger class of loss-functions.


Bias with respect to other loss functions

Any minimum-variance ''mean''-unbiased estimator minimizes the
risk In simple terms, risk is the possibility of something bad happening. Risk involves uncertainty about the effects/implications of an activity with respect to something that humans value (such as health, well-being, wealth, property or the environm ...
(
expected loss Expected loss is the sum of the values of all possible losses, each multiplied by the probability of that loss occurring. In bank lending (homes, autos, credit cards, commercial lending, etc.) the expected loss on a loan varies over time for a nu ...
) with respect to the squared-error
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
(among mean-unbiased estimators), as observed by
Gauss Johann Carl Friedrich Gauss (; german: Gauß ; la, Carolus Fridericus Gauss; 30 April 177723 February 1855) was a German mathematician and physicist who made significant contributions to many fields in mathematics and science. Sometimes refer ...
. A minimum-
average absolute deviation The average absolute deviation (AAD) of a data set is the average of the absolute deviations from a central point. It is a summary statistic of statistical dispersion or variability. In the general form, the central point can be a mean, median, m ...
''median''-unbiased estimator minimizes the risk with respect to the
absolute Absolute may refer to: Companies * Absolute Entertainment, a video game publisher * Absolute Radio, (formerly Virgin Radio), independent national radio station in the UK * Absolute Software Corporation, specializes in security and data risk manag ...
loss function (among median-unbiased estimators), as observed by
Laplace Pierre-Simon, marquis de Laplace (; ; 23 March 1749 – 5 March 1827) was a French scholar and polymath whose work was important to the development of engineering, mathematics, statistics, physics, astronomy, and philosophy. He summarized ...
. Other loss functions are used in statistics, particularly in
robust statistics Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, su ...
.


Effect of transformations

For univariate parameters, median-unbiased estimators remain median-unbiased under transformations that preserve order (or reverse order). Note that, when a transformation is applied to a mean-unbiased estimator, the result need not be a mean-unbiased estimator of its corresponding population statistic. By
Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier pr ...
, a
convex function In mathematics, a real-valued function is called convex if the line segment between any two points on the graph of the function lies above the graph between the two points. Equivalently, a function is convex if its epigraph (the set of poi ...
as transformation will introduce positive bias, while a
concave function In mathematics, a concave function is the negative of a convex function. A concave function is also synonymously called concave downwards, concave down, convex upwards, convex cap, or upper convex. Definition A real-valued function f on an ...
will introduce negative bias, and a function of mixed convexity may introduce bias in either direction, depending on the specific function and distribution. That is, for a non-linear function ''f'' and a mean-unbiased estimator ''U'' of a parameter ''p'', the composite estimator ''f''(''U'') need not be a mean-unbiased estimator of ''f''(''p''). For example, the
square root In mathematics, a square root of a number is a number such that ; in other words, a number whose '' square'' (the result of multiplying the number by itself, or  ⋅ ) is . For example, 4 and −4 are square roots of 16, because . ...
of the unbiased estimator of the population
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
is a mean-unbiased estimator of the population
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, whil ...
: the square root of the unbiased
sample variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
, the corrected
sample standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
, is biased. The bias depends both on the sampling distribution of the estimator and on the transform, and can be quite involved to calculate – see
unbiased estimation of standard deviation In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation (a measure of statistical dispersion) of a population of val ...
for a discussion in this case.


Bias, variance and mean squared error

While bias quantifies the ''average'' difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample. An estimator that minimises the bias will not necessarily minimise the mean square error. One measure which is used to try to reflect both types of difference is the
mean square error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
, :\operatorname(\hat)=\operatorname\big \hat-\theta)^2\big This can be shown to be equal to the square of the bias, plus the variance: :\begin \operatorname(\hat)= & (\operatorname
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
\theta)^2 + \operatorname ,(\hat_-_\operatorname[\,\hat\,^2\,.html" ;"title=",\hat\,.html" ;"title=",(\hat - \operatorname[\,\hat\,">,(\hat - \operatorname[\,\hat\,^2\,">,\hat\,.html" ;"title=",(\hat - \operatorname[\,\hat\,">,(\hat - \operatorname[\,\hat\,^2\,\ = & (\operatorname(\hat,\theta))^2 + \operatorname(\hat) \end When the parameter is a vector, an analogous decomposition applies: :\operatorname(\hat) =\operatorname(\operatorname(\hat)) +\left\Vert\operatorname(\hat,\theta)\right\Vert^ where \operatorname(\operatorname(\hat)) is the trace (diagonal sum) of the covariance matrix of the estimator and \left\Vert\operatorname(\hat,\theta)\right\Vert^ is the square vector norm.


Example: Estimation of population variance

For example, suppose an estimator of the form :T^2 = c \sum_^n\left(X_i-\overline\,\right)^2 = c n S^2 is sought for the population variance as above, but this time to minimise the MSE: :\begin\operatorname = & \operatorname\left T^2 - \sigma^2)^2\right\\ = & \left(\operatorname\left ^2 - \sigma^2\rightright)^2 + \operatorname(T^2)\end If the variables ''X''1 ... ''X''''n'' follow a normal distribution, then ''nS''22 has a
chi-squared distribution In probability theory and statistics, the chi-squared distribution (also chi-square or \chi^2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squar ...
with ''n'' − 1 degrees of freedom, giving: :\operatorname S^2= (n-1)\sigma^2\text\operatorname(nS^2)=2(n-1)\sigma^4. and so :\operatorname = (c (n-1) - 1)^2\sigma^4 + 2c^2(n-1)\sigma^4 With a little algebra it can be confirmed that it is ''c'' = 1/(''n'' + 1) which minimises this combined loss function, rather than ''c'' = 1/(''n'' − 1) which minimises just the square of the bias. More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values. However it is very common that there may be perceived to be a ''
bias–variance tradeoff In statistics and machine learning, the bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters. The bias–variance di ...
'', such that a small increase in bias can be traded for a larger decrease in variance, resulting in a more desirable estimator overall.


Bayesian view

Most bayesians are rather unconcerned about unbiasedness (at least in the formal sampling-theory sense above) of their estimates. For example, Gelman and coauthors (1995) write: "From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwise it is potentially misleading." Fundamentally, the difference between the Bayesian approach and the sampling-theory approach above is that in the sampling-theory approach the parameter is taken as fixed, and then probability distributions of a statistic are considered, based on the predicted sampling distribution of the data. For a Bayesian, however, it is the ''data'' which are known, and fixed, and it is the unknown parameter for which an attempt is made to construct a probability distribution, using
Bayes' theorem In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
: :p(\theta \mid D, I) \propto p(\theta \mid I) p(D \mid \theta, I) Here the second term, the
likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
of the data given the unknown parameter value θ, depends just on the data obtained and the modelling of the data generation process. However a Bayesian calculation also includes the first term, the
prior probability In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
for θ, which takes account of everything the analyst may know or suspect about θ ''before'' the data comes in. This information plays no part in the sampling-theory approach; indeed any attempt to include it would be considered "bias" away from what was pointed to purely by the data. To the extent that Bayesian calculations include prior information, it is therefore essentially inevitable that their results will not be "unbiased" in sampling theory terms. But the results of a Bayesian approach can differ from the sampling theory approach even if the Bayesian tries to adopt an "uninformative" prior. For example, consider again the estimation of an unknown population variance σ2 of a Normal distribution with unknown mean, where it is desired to optimise ''c'' in the expected loss function :\operatorname = \operatorname\left left(c n S^2 - \sigma^2\right)^2\right= \operatorname\left sigma^4 \left(c n \tfrac -1 \right)^2\right/math> A standard choice of uninformative prior for this problem is the
Jeffreys prior In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher info ...
, \scriptstyle, which is equivalent to adopting a rescaling-invariant flat prior for ln(σ2). One consequence of adopting this prior is that ''S''22 remains a
pivotal quantity In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters (including nuisance parameters). A pivot quantity need ...
, i.e. the probability distribution of ''S''22 depends only on ''S''22, independent of the value of ''S''2 or σ2: :p\left(\tfrac\mid S^2\right) = p\left(\tfrac\mid \sigma^2\right) = g\left(\tfrac\right) However, while :\operatorname_\left sigma^4 \left(c n \tfrac -1 \right)^2\right= \sigma^4 \operatorname_\left left(c n \tfrac -1 \right)^2\right/math> in contrast :\operatorname_\left sigma^4 \left(c n \tfrac -1 \right)^2\right\neq \sigma^4 \operatorname_\left left(c n \tfrac -1 \right)^2\right/math> — when the expectation is taken over the probability distribution of σ2 given ''S''2, as it is in the Bayesian case, rather than ''S''2 given σ2, one can no longer take σ4 as a constant and factor it out. The consequence of this is that, compared to the sampling-theory calculation, the Bayesian calculation puts more weight on larger values of σ2, properly taking into account (as the sampling-theory calculation cannot) that under this squared-loss function the consequence of underestimating large values of σ2 is more costly in squared-loss terms than that of overestimating small values of σ2. The worked-out Bayesian calculation gives a scaled inverse chi-squared distribution with ''n'' − 1 degrees of freedom for the posterior probability distribution of σ2. The expected loss is minimised when ''cnS''2 = <σ2>; this occurs when ''c'' = 1/(''n'' − 3). Even with an uninformative prior, therefore, a Bayesian calculation may not give the same expected-loss minimising result as the corresponding sampling-theory calculation.


See also

*
Consistent estimator In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...
*
Efficient estimator In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, needs fewer input data or observations than a less efficient one to achie ...
*
Estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their valu ...
*
Expected loss Expected loss is the sum of the values of all possible losses, each multiplied by the probability of that loss occurring. In bank lending (homes, autos, credit cards, commercial lending, etc.) the expected loss on a loan varies over time for a nu ...
*
Expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
*
Loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
*
Minimum-variance unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. For pra ...
*
Omitted-variable bias In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included. More specifically, OV ...
*
Optimism bias Optimism bias (or the optimistic bias) is a cognitive bias that causes someone to believe that they themselves are less likely to experience a negative event. It is also known as unrealistic optimism or comparative optimism. Optimism bias is commo ...
*
Ratio estimator The ratio estimator is a statistical parameter and is defined to be the ratio of means of two random variables. Ratio estimates are biased and corrections must be made when they are used in experimental or survey work. The ratio estimates are asymm ...
*
Statistical decision theory Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...


Notes


References


Brown, George W.
"On Small-Sample Estimation." ''The Annals of Mathematical Statistics'', vol. 18, no. 4 (Dec., 1947), pp. 582–585. . * Lehmann, E. L. "A General Concept of Unbiasedness" ''The Annals of Mathematical Statistics'', vol. 22, no. 4 (Dec., 1951), pp. 587–592. . *
Allan Birnbaum Allan Birnbaum (May 27, 1923 – July 1, 1976) was an American statistician who contributed to statistical inference, foundations of statistics, statistical genetics, statistical psychology, and history of statistics. Life and career Birnbaum wa ...
, 1961. "A Unified Theory of Estimation, I", ''The Annals of Mathematical Statistics'', vol. 32, no. 1 (Mar., 1961), pp. 112–135. * Van der Vaart, H. R., 1961.
Some Extensions of the Idea of Bias
''The Annals of Mathematical Statistics'', vol. 32, no. 2 (June 1961), pp. 436–447. * Pfanzagl, Johann. 1994. ''Parametric Statistical Theory''. Walter de Gruyter. * . * * *


External links

* {{DEFAULTSORT:Bias of an estimator Accuracy and precision Point estimation performance estimator, of an