In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, pooled variance (also known as combined variance, composite variance, or overall variance, and written
) is a method for
estimating variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same. The numerical estimate resulting from the use of this method is also called the pooled variance.
Under the assumption of equal population variances, the pooled sample variance provides a higher
precision estimate of variance than the individual sample variances. This higher precision can lead to increased
statistical power
In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...
when used in
statistical test
A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. ...
s that compare the populations, such as the
''t''-test.
The square root of a pooled variance estimator is known as a pooled standard deviation (also known as combined standard deviation, composite standard deviation, or overall standard deviation).
Motivation
In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, many times, data are collected for a
dependent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
, ''y'', over a range of values for the
independent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
, ''x''. For example, the observation of fuel consumption might be studied as a function of engine speed while the engine load is held constant. If, in order to achieve a small
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
in ''y'', numerous repeated tests are required at each value of ''x'', the expense of testing may become prohibitive. Reasonable estimates of variance can be determined by using the principle of pooled variance after repeating each
test
Test(s), testing, or TEST may refer to:
* Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities
Arts and entertainment
* ''Test'' (2013 film), an American film
* ''Test'' (2014 film) ...
at a particular ''x'' only a few times.
Definition and computation
The pooled variance is an estimate of the fixed common variance
underlying various populations that have different means.
We are given a set of
sample variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, ...
s
, where the populations are indexed
,
:
=
Assuming uniform
sample size
Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences abo ...
s,
, then the pooled variance
can be computed by the
arithmetic mean
In mathematics and statistics, the arithmetic mean ( ), arithmetic average, or just the ''mean'' or ''average'' is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results fr ...
:
:
If the sample sizes are non-uniform, then the pooled variance
can be computed by the
weighted average, using as weights
the respective
degrees of freedom
In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
(see also:
Bessel's correction
In statistics, Bessel's correction is the use of ''n'' − 1 instead of ''n'' in the formula for the sample variance and sample standard deviation, where ''n'' is the number of observations in a sample. This method corrects the bias in ...
):
:
The distribution of
is
.
Proof. When there is a single mean, the distribution of
is a gaussian in
, the
-dimensional simplex, with standard deviation
. Where there are multiple means, the distribution of
is a gaussian in
.
Variants
The unbiased least squares estimate of
(as presented above),
and the biased maximum likelihood estimate below:
:
are used in different contexts. The former can give an unbiased
to estimate
when the two groups share an equal population variance. The latter one can give a more
efficient to estimate
, although subject to bias. Note that the quantities
in the right hand sides of both equations are the unbiased estimates.
Example
Consider the following set of data for ''y'' obtained at various levels of the independent variable ''x''.
The number of trials, mean, variance and standard deviation are presented in the next table.
These statistics represent the variance and
standard deviation
In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
for each subset of data at the various levels of ''x''. If we can assume that the same phenomena are generating
random error
Observational error (or measurement error) is the difference between a measured value of a quantity and its unknown true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. Such errors are inherent in the measurement ...
at every level of ''x'', the above data can be “pooled” to express a single estimate of variance and standard deviation. In a sense, this suggests finding a
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
variance or standard deviation among the five results above. This mean variance is calculated by weighting the individual values with the size of the subset for each level of ''x''. Thus, the pooled variance is defined by
:
where ''n''
1, ''n''
2, . . ., ''n''
''k'' are the sizes of the data subsets at each level of the variable ''x'', and ''s''
12, ''s''
22, . . ., ''s''
''k''2 are their respective variances.
The pooled variance of the data shown above is therefore:
:
Effect on precision
Pooled variance is an estimate when there is a correlation between pooled data sets or the average of the data sets is not identical. Pooled variation is less precise the more non-zero the correlation or distant the averages between data sets.
The variation of data for non-overlapping data sets is:
:
where the mean is defined as:
:
Given a biased maximum likelihood defined as:
:
Then the error in the biased maximum likelihood estimate is:
:
Assuming ''N'' is large such that:
:
Then the error in the estimate reduces to:
:
Or alternatively:
:
Aggregation of standard deviation data
Rather than estimating pooled standard deviation, the following is the way to exactly aggregate standard deviation when more statistical information is available.
Population-based statistics
The populations of sets, which may overlap, can be calculated simply as follows:
:
The populations of sets, which do not overlap, can be calculated simply as follows:
:
Standard deviations of non-overlapping () sub-populations can be aggregated as follows if the size (actual or relative to one another) and means of each are known:
:
For example, suppose it is known that the average American man has a mean height of 70 inches with a standard deviation of three inches and that the average American woman has a mean height of 65 inches with a standard deviation of two inches. Also assume that the number of men, ''N'', is equal to the number of women. Then the mean and standard deviation of heights of American adults could be calculated as
:
For the more general case of ''M'' non-overlapping populations, ''X''
1 through ''X''
''M'', and the aggregate population
,
:
,
where
: