In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, sometimes the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of a
multivariate random variable
In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...
is not known but has to be
estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the
multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the
sample covariance matrix. The sample covariance matrix (SCM) is an
unbiased
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
and
efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an
extrinsic convex cone
In linear algebra, a cone—sometimes called a linear cone to distinguish it from other sorts of cones—is a subset of a real vector space that is closed under positive scalar multiplication; that is, C is a cone if x\in C implies sx\in C for e ...
in R
''p''×''p''; however, measured using the
intrinsic geometry of
positive-definite matrices, the SCM is a
biased and inefficient estimator.
In addition, if the random variable has a
normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
, the sample covariance matrix has a
Wishart distribution and a slightly differently scaled version of it is the
maximum likelihood estimate
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
. Cases involving
missing data
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Mi ...
,
heteroscedasticity
In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
, or autocorrelated residuals require deeper considerations. Another issue is the
robustness
Robustness is the property of being strong and healthy in constitution. When it is transposed into a system
A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, ...
to
outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s, to which sample covariance matrices are highly sensitive.
Statistical analyses of multivariate data often involve exploratory studies of the way in which the variables change in relation to one another and this may be followed up by explicit statistical models involving the covariance matrix of the variables. Thus the estimation of covariance matrices directly from observational data plays two roles:
:* to provide initial estimates that can be used to study the inter-relationships;
:* to provide sample estimates that can be used for model checking.
Estimates of covariance matrices are required at the initial stages of
principal component analysis
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
The data is linearly transformed onto a new coordinate system such that th ...
and
factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observe ...
, and are also involved in versions of
regression analysis that treat the
dependent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
s in a data-set, jointly with the
independent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
as the outcome of a random sample.
Estimation in a general context
Given a
sample consisting of ''n'' independent observations ''x''
1,..., ''x''
''n'' of a ''p''-dimensional
random vector
In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...
''X'' ∈ R
''p''×1 (a ''p''×1 column-vector), an
unbiased
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
of the (''p''×''p'')
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
:
is the Sample mean and covariance, sample covariance matrix
:
where
is the ''i''-th observation of the ''p''-dimensional random vector, and the vector
:
is the
sample mean
The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.
The sample mean is the average value (or me ...
.
This is true regardless of the distribution of the random variable ''X'', provided of course that the theoretical means and covariances exist. The reason for the factor ''n'' − 1 rather than ''n'' is essentially the same as the reason for the same factor appearing in unbiased estimates of
sample variances and
sample covariances, which relates to the fact that the mean is not known and is replaced by the sample mean (see
Bessel's correction
In statistics, Bessel's correction is the use of ''n'' − 1 instead of ''n'' in the formula for the sample variance and sample standard deviation, where ''n'' is the number of observations in a sample. This method corrects the bias in ...
).
In cases where the distribution of the
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
''X'' is known to be within a certain family of distributions, other estimates may be derived on the basis of that assumption. A well-known instance is when the
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
''X'' is
normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
: in this case the
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
of the covariance matrix is slightly different from the unbiased estimate, and is given by
:
A derivation of this result is given below. Clearly, the difference between the unbiased estimator and the maximum likelihood estimator diminishes for large ''n''.
In the general case, the unbiased estimate of the covariance matrix provides an acceptable estimate when the data vectors in the observed data set are all complete: that is they contain no
missing elements. One approach to estimating the covariance matrix is to treat the estimation of each variance or pairwise covariance separately, and to use all the observations for which both variables have valid values. Assuming the missing data are
missing at random
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Mi ...
this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix.
When estimating the
cross-covariance
In probability and statistics, given two stochastic processes \left\ and \left\, the cross-covariance is a function that gives the covariance of one process with the other at pairs of time points. With the usual notation \operatorname E for th ...
of a pair of signals that are
wide-sense stationary, missing samples do ''not'' need be random (e.g., sub-sampling by an arbitrary factor is valid).
Maximum-likelihood estimation for the multivariate normal distribution
A random vector ''X'' ∈ R
''p'' (a ''p''×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈ R
''p'' × ''p'' is a
positive-definite matrix
In mathematics, a symmetric matrix M with real entries is positive-definite if the real number \mathbf^\mathsf M \mathbf is positive for every nonzero real column vector \mathbf, where \mathbf^\mathsf is the row vector transpose of \mathbf.
Mo ...
and the
probability density function
In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...
of ''X'' is
:
where ''μ'' ∈ R
''p''×1 is the
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
of ''X''. The
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
Σ is the multidimensional analog of what in one dimension would be the
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
, and
:
normalizes the density
so that it integrates to 1.
Suppose now that ''X''
1, ..., ''X''
''n'' are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
and identically distributed samples from the distribution above. Based on the
observed values ''x''
1, ..., ''x''
''n'' of this
sample, we wish to estimate Σ.
First steps
The likelihood function is:
:
It is fairly readily shown that the
maximum-likelihood estimate of the mean vector ''μ'' is the "
sample mean
The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.
The sample mean is the average value (or me ...
" vector:
:
See
the section on estimation in the article on the normal distribution for details; the process here is similar.
Since the estimate
does not depend on Σ, we can just substitute it for ''μ'' in the
likelihood function
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
, getting
:
and then seek the value of Σ that maximizes the likelihood of the data (in practice it is easier to work with log
).
The trace of a 1 × 1 matrix
Now we come to the first surprising step: regard the
scalar as the
trace of a 1×1 matrix. This makes it possible to use the identity tr(''AB'') = tr(''BA'') whenever ''A'' and ''B'' are matrices so shaped that both products exist. We get
:
where
:
is sometimes called the
scatter matrix, and is positive definite if there exists a subset of the data consisting of
affinely independent observations (which we will assume).
Using the spectral theorem
It follows from the
spectral theorem
In linear algebra and functional analysis, a spectral theorem is a result about when a linear operator or matrix can be diagonalized (that is, represented as a diagonal matrix in some basis). This is extremely useful because computations involvin ...
of
linear algebra
Linear algebra is the branch of mathematics concerning linear equations such as
:a_1x_1+\cdots +a_nx_n=b,
linear maps such as
:(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n,
and their representations in vector spaces and through matrix (mathemat ...
that a positive-definite symmetric matrix ''S'' has a unique positive-definite symmetric square root ''S''
1/2. We can again use the
"cyclic property" of the trace to write
:
Let ''B'' = ''S''
1/2 Σ
−1 ''S''
1/2. Then the expression above becomes
:
The positive-definite matrix ''B'' can be diagonalized, and then the problem of finding the value of ''B'' that maximizes
:
Since the trace of a square matrix equals the sum of eigenvalues (
"trace and eigenvalues"), the equation reduces to the problem of finding the eigenvalues ''λ''
1, ..., ''λ''
''p'' that maximize
:
This is just a calculus problem and we get ''λ''
''i'' = ''n'' for all ''i.'' Thus, assume ''Q'' is the matrix of eigen vectors, then
:
i.e., ''n'' times the ''p''×''p'' identity matrix.
Concluding steps
Finally we get
:
i.e., the ''p''×''p'' "sample covariance matrix"
:
is the maximum-likelihood estimator of the "population covariance matrix" Σ. At this point we are using a capital ''X'' rather than a lower-case ''x'' because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. The random matrix ''S'' can be shown to have a
Wishart distribution with ''n'' − 1 degrees of freedom. That is:
:
Alternative derivation
An alternative derivation of the maximum likelihood estimator can be performed via
matrix calculus
In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrix (mathematics), matrices. It collects the various partial derivatives of a single Function (mathematics), function with ...
formulae (see also
differential of a determinant and
differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick:
:
The differential of this log-likelihood is
:
It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The
first order condition for maximum,
, is satisfied when the terms multiplying
and
are identically zero. Assuming (the maximum likelihood estimate of)
is non-singular, the first order condition for the estimate of the mean vector is
:
which leads to the maximum likelihood estimator
:
This lets us simplify
:
as defined above. Then the terms involving
in
can be combined as
:
The first order condition
will hold when the term in the square bracket is (matrix-valued) zero. Pre-multiplying the latter by
and dividing by
gives
:
which of course coincides with the canonical derivation given earlier.
Dwyer
points out that decomposition into two terms such as appears above is "unnecessary" and derives the estimator in two lines of working. Note that it may be not trivial to show that such derived estimator is the unique global maximizer for likelihood function.
Intrinsic covariance matrix estimation
Intrinsic expectation
Given a
sample of ''n'' independent observations ''x''
1,..., ''x''
''n'' of a ''p''-dimensional zero-mean Gaussian random variable ''X'' with covariance R, the
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
of R is given by
:
The parameter
belongs to the set of
positive-definite matrices, which is a
Riemannian manifold
In differential geometry, a Riemannian manifold is a geometric space on which many geometric notions such as distance, angles, length, volume, and curvature are defined. Euclidean space, the N-sphere, n-sphere, hyperbolic space, and smooth surf ...
, not a
vector space
In mathematics and physics, a vector space (also called a linear space) is a set (mathematics), set whose elements, often called vector (mathematics and physics), ''vectors'', can be added together and multiplied ("scaled") by numbers called sc ...
, hence the usual vector-space notions of
expectation, i.e. "