Sample covariance
   HOME

TheInfoList



OR:

The sample mean (or "empirical mean") and the sample covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the
average In ordinary language, an average is a single number taken as representative of a list of numbers, usually the sum of the numbers divided by how many numbers are in the list (the arithmetic mean). For example, the average of the numbers 2, 3, 4, 7 ...
value (or
mean value There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the ''arithm ...
) of a sample of numbers taken from a larger
population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
of numbers, where "population" indicates not number of people but the entirety of relevant data, whether collected or not. A sample of 40 companies' sales from the Fortune 500 might be used for convenience instead of looking at the population, all 500 companies' sales. The sample mean is used as an
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
for the population mean, the average value in the entire population, where the estimate is more likely to be close to the population mean if the sample is large and representative. The reliability of the sample mean is estimated using the
standard error The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error o ...
, which in turn is calculated using the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
of the sample. If the sample is random, the standard error falls with the size of the sample and the sample mean's distribution approaches the normal distribution as the sample size increases. The term "sample mean" can also be used to refer to a
vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...
of average values when the statistician is looking at the values of several variables in the sample, e.g. the sales, profits, and employees of a sample of Fortune 500 companies. In this case, there is not just a sample variance for each variable but a sample variance-covariance matrix (or simply covariance matrix) showing also the relationship between each pair of variables. This would be a 3×3 matrix when 3 variables are being considered. The sample covariance is useful in judging the reliability of the sample means as estimators and is also useful as an estimate of the population covariance matrix. Due to their ease of calculation and other desirable characteristics, the sample mean and sample covariance are widely used in statistics to represent the
location In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...
and
dispersion Dispersion may refer to: Economics and finance * Dispersion (finance), a measure for the statistical distribution of portfolio returns * Price dispersion, a variation in prices across sellers of the same item *Wage dispersion, the amount of variat ...
of the distribution of values in the sample, and to estimate the values for the population.


Definition of the sample mean

The sample mean is the average of the values of a variable in a sample, which is the sum of those values divided by the number of values. Using mathematical notation, if a sample of ''N'' observations on variable ''X'' is taken from the population, the sample mean is: : \bar=\frac\sum_^X_. Under this definition, if the sample (1, 4, 1) is taken from the population (1,1,3,4,0,2,1,0), then the sample mean is \bar = (1+4+1)/3 = 2, as compared to the population mean of \mu = (1+1+3+4+0+2+1+0) /8 = 12/8 = 1.5. Even if a sample is random, it is rarely perfectly representative, and other samples would have other sample means even if the samples were all from the same population. The sample (2, 1, 0), for example, would have a sample mean of 1. If the statistician is interested in ''K'' variables rather than one, each observation having a value for each of those ''K'' variables, the overall sample mean consists of ''K'' sample means for individual variables. Let x_ be the ''i''th independently drawn observation (''i''=1,...,''N'') on the ''j''th random variable (''j''=1,...,''K''). These observations can be arranged into ''N'' column vectors, each with ''K'' entries, with the ''K''×1 column vector giving the ''i''-th observations of all variables being denoted \mathbf_i (''i''=1,...,''N''). The sample mean vector \mathbf is a column vector whose ''j''-th element \bar_ is the average value of the ''N'' observations of the ''j''th variable: : \bar_=\frac \sum_^ x_,\quad j=1,\ldots,K. Thus, the sample mean vector contains the average of the observations for each variable, and is written : \mathbf=\frac\sum_^\mathbf_i = \begin \bar_1 \\ \vdots \\ \bar_j \\ \vdots \\ \bar_K \end


Definition of sample covariance

The sample covariance matrix is a ''K''-by-''K''
matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...
\textstyle \mathbf=\left q_\right with entries : q_=\frac\sum_^\left( x_-\bar_j \right) \left( x_-\bar_k \right), where q_ is an estimate of the
covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the ...
between the th variable and the th variable of the population underlying the data. In terms of the observation vectors, the sample covariance is :\mathbf = \sum_^N (\mathbf_i.-\mathbf) (\mathbf_i.-\mathbf)^\mathrm, Alternatively, arranging the observation vectors as the columns of a matrix, so that :\mathbf = \begin\mathbf_1 & \mathbf_2 & \dots & \mathbf_N \end, which is a matrix of ''K'' rows and ''N'' columns. Here, the sample covariance matrix can be computed as :\mathbf = \frac( \mathbf - \mathbf \,\mathbf_N^\mathrm ) ( \mathbf - \mathbf \,\mathbf_N^\mathrm )^\mathrm, where \mathbf_N is an ''N'' by vector of ones. If the observations are arranged as rows instead of columns, so \mathbf is now a 1×''K'' row vector and \mathbf=\mathbf^\mathrm is an ''N''×''K'' matrix whose column ''j'' is the vector of ''N'' observations on variable ''j'', then applying transposes in the appropriate places yields :\mathbf = \frac( \mathbf - \mathbf_N \mathbf )^\mathrm ( \mathbf - \mathbf_N \mathbf ). Like covariance matrices for
random vector In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value ...
, sample covariance matrices are positive semi-definite. To prove it, note that for any matrix \mathbf the matrix \mathbf^T\mathbf is positive semi-definite. Furthermore, a covariance matrix is positive definite if and only if the rank of the \mathbf_i.-\mathbf vectors is K.


Unbiasedness

The sample mean and the sample covariance matrix are unbiased estimates of the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the '' ari ...
and the covariance matrix of the
random vector In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value ...
\textstyle \mathbf, a row vector whose ''j''th element (''j = 1, ..., K'') is one of the random variables. The sample covariance matrix has \textstyle N-1 in the denominator rather than \textstyle N due to a variant of
Bessel's correction In statistics, Bessel's correction is the use of ''n'' − 1 instead of ''n'' in the formula for the sample variance and sample standard deviation, where ''n'' is the number of observations in a sample. This method corrects the bias in ...
: In short, the sample covariance relies on the difference between each observation and the sample mean, but the sample mean is slightly correlated with each observation since it is defined in terms of all observations. If the population mean \operatorname(\mathbf) is known, the analogous unbiased estimate : q_=\frac\sum_^N \left( x_-\operatorname(X_j)\right) \left( x_-\operatorname(X_k)\right), using the population mean, has \textstyle N in the denominator. This is an example of why in probability and statistics it is essential to distinguish between random variables (upper case letters) and realizations of the random variables (lower case letters). The
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
estimate of the covariance : q_=\frac\sum_^N \left( x_-\bar_j \right) \left( x_-\bar_k \right) for the Gaussian distribution case has ''N'' in the denominator as well. The ratio of 1/''N'' to 1/(''N'' − 1) approaches 1 for large ''N'', so the maximum likelihood estimate approximately equals the unbiased estimate when the sample is large.


Distribution of the sample mean

For each random variable, the sample mean is a good
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of the population mean, where a "good" estimator is defined as being efficient and unbiased. Of course the estimator will likely not be the true value of the
population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean. Thus the sample mean is a random variable, not a constant, and consequently has its own distribution. For a random sample of ''N'' observations on the ''j''th random variable, the sample mean's distribution itself has mean equal to the population mean E(X_j) and variance equal to \sigma^2_j/N, where \sigma^2_j is the population variance. The arithmetic mean of a
population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
, or population mean, is often denoted ''μ''. The sample mean \bar (the arithmetic mean of a sample of values drawn from the population) makes a good
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of the population mean, as its expected value is equal to the population mean (that is, it is an
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In sta ...
). The sample mean is a random variable, not a constant, since its calculated value will randomly differ depending on which members of the population are sampled, and consequently it will have its own distribution. For a random sample of ''n''
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
observations, the expected value of the sample mean is : \operatorname E (\bar) = \mu and the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
of the sample mean is : \operatorname(\bar) = \frac n. If the samples are not independent, but
correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
, then special care has to be taken in order to avoid the problem of
pseudoreplication Pseudoreplication (sometimes unit of analysis error) has many definitions. Pseudoreplication was originally defined in 1984 by Stuart H. Hurlbert as the use of inferential statistics to test for treatment effects with data from experiments where ...
. If the population is normally distributed, then the sample mean is normally distributed as follows: : \bar \thicksim N\left\. If the population is not normally distributed, the sample mean is nonetheless approximately normally distributed if ''n'' is large and ''σ''2/''n'' < +∞. This is a consequence of the
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themsel ...
.


Weighted samples

In a weighted sample, each vector \textstyle \textbf_ (each set of single observations on each of the ''K'' random variables) is assigned a weight \textstyle w_i \geq0. Without loss of generality, assume that the weights are normalized: : \sum_^w_i = 1. (If they are not, divide the weights by their sum). Then the
weighted mean The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
vector \textstyle \mathbf is given by : \mathbf=\sum_^N w_i \mathbf_i. and the elements q_ of the weighted covariance matrix \textstyle \mathbf are Mark Galassi, Jim Davies, James Theiler, Brian Gough, Gerard Jungman, Michael Booth, and Fabrice Rossi
GNU Scientific Library - Reference manual, Version 2.6
2021.

/ref> : q_=\frac \sum_^N w_i \left( x_-\bar_j \right) \left( x_-\bar_k \right) . If all weights are the same, \textstyle w_=1/N, the weighted mean and covariance reduce to the (biased) sample mean and covariance mentioned above.


Criticism

The sample mean and sample covariance are not
robust statistics Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, su ...
, meaning that they are sensitive to
outliers In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile th ...
-based statistics such as the sample median for location,The World Question Center 2006: The Sample Mean
Bart Kosko Bart Andrew Kosko (born February 7, 1960) is a writer and professor of electrical engineering and law at the University of Southern California (USC). He is a researcher and popularizer of fuzzy logic, neural networks, and noise, and author of sev ...
and
interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference ...
(IQR) for dispersion. Other alternatives include trimming and
Winsorising Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. It is named after the engineer-turned-biostatistician Charles Winsor, Charles P. ...
, as in the
trimmed mean A truncated mean or trimmed mean is a statistical measure of central tendency, much like the mean and median. It involves the calculation of the mean after discarding given parts of a probability distribution or sample at the high and low end ...
and the
Winsorized mean A winsorized mean is a winsorising, winsorized statistical measure of central tendency, much like the mean and median, and even more similar to the truncated mean. It involves the calculation of the mean after winsorizing -- replacing given part ...
.


See also

*
Estimation of covariance matrices In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis ...
*
Scatter matrix : ''For the notion in quantum mechanics, see scattering matrix.'' In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix, for instance of the multivariate norma ...
*
Unbiased estimation of standard deviation In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation (a measure of statistical dispersion) of a population of val ...


References

{{Authority control Covariance and correlation Estimation methods Summary statistics Matrices U-statistics