Hotelling's T-squared Statistic
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, particularly in
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
, the Hotelling's ''T''-squared distribution (''T''2), proposed by
Harold Hotelling Harold Hotelling (; September 29, 1895 – December 26, 1973) was an American mathematical statistician and an influential economic theorist, known for Hotelling's law, Hotelling's lemma, and Hotelling's rule in economics, as well as Hotelling ...
, is a multivariate probability distribution that is tightly related to the ''F''-distribution and is most notable for arising as the distribution of a set of
sample statistics A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...
that are natural generalizations of the statistics underlying the Student's ''t''-distribution. The Hotelling's ''t''-squared statistic (''t''2) is a generalization of Student's ''t''-statistic that is used in multivariate
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
.


Motivation

The distribution arises in
multivariate statistics Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., '' multivariate random variables''. Multivariate statistics concerns understanding the differ ...
in undertaking
tests Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film) ...
of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a ''t''-test. The distribution is named for
Harold Hotelling Harold Hotelling (; September 29, 1895 – December 26, 1973) was an American mathematical statistician and an influential economic theorist, known for Hotelling's law, Hotelling's lemma, and Hotelling's rule in economics, as well as Hotelling ...
, who developed it as a generalization of Student's ''t''-distribution.


Definition

If the vector d is Gaussian multivariate-distributed with zero mean and unit
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
N(\mathbf_, \mathbf_) and M is a p \times p random matrix with a
Wishart distribution In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart (statistician), John Wishart, who first formulated the distribution in 1928. Other names include Wi ...
W(\mathbf_, m) with unit
scale matrix In affine geometry, uniform scaling (or isotropic scaling) is a linear transformation that enlarges (increases) or shrinks (diminishes) objects by a '' scale factor'' that is the same in all directions ( isotropically). The result of uniform sc ...
and ''m''
degrees of freedom In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
, and ''d'' and ''M'' are independent of each other, then the
quadratic form In mathematics, a quadratic form is a polynomial with terms all of degree two (" form" is another name for a homogeneous polynomial). For example, 4x^2 + 2xy - 3y^2 is a quadratic form in the variables and . The coefficients usually belong t ...
X has a Hotelling distribution (with parameters p and m): :X = m d^T M^ d \sim T^2(p, m). It can be shown that if a random variable ''X'' has Hotelling's ''T''-squared distribution, X \sim T^2_, then: : \frac X\sim F_ where F_ is the ''F''-distribution with parameters ''p'' and ''m'' − ''p'' + 1.


Hotelling ''t''-squared statistic

Let \hat be the
sample covariance The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or me ...
: : \hat = \frac 1 \sum_^n (\mathbf_i -\overline) (\mathbf_i-\overline)' where we denote
transpose In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...
by an
apostrophe The apostrophe (, ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes: * The marking of the omission of one o ...
. It can be shown that \hat is a positive (semi) definite matrix and (n-1)\hat follows a ''p''-variate
Wishart distribution In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart (statistician), John Wishart, who first formulated the distribution in 1928. Other names include Wi ...
with ''n'' − 1 degrees of freedom. The sample covariance matrix of the mean reads \hat_\overline=\hat/n. The Hotelling's ''t''-squared statistic is then defined as: : t^2=(\overline-\boldsymbol)'\hat_\overline^ (\overline-\boldsymbol)=n(\overline-\boldsymbol)'\hat^ (\overline-\boldsymbol), which is proportional to the
Mahalanobis distance The Mahalanobis distance is a distance measure, measure of the distance between a point P and a probability distribution D, introduced by Prasanta Chandra Mahalanobis, P. C. Mahalanobis in 1936. The mathematical details of Mahalanobis distance ...
between the sample mean and \boldsymbol. Because of this, one should expect the statistic to assume low values if \overline \approx \boldsymbol, and high values if they are different. From the
distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
, :t^2 \sim T^2_=\frac F_ , where F_ is the ''F''-distribution with parameters ''p'' and ''n'' − ''p''. In order to calculate a ''p''-value (unrelated to ''p'' variable here), note that the distribution of t^2 equivalently implies that : \frac t^2 \sim F_ . Then, use the quantity on the left hand side to evaluate the ''p''-value corresponding to the sample, which comes from the ''F''-distribution. A
confidence region In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. For a bivariate normal distribution, it is an ellipse, also known as the error ellipse. More generally, it is a set of points in an ''n''-dimension ...
may also be determined using similar logic.


Motivation

Let \mathcal_p(\boldsymbol,) denote a ''p''-variate normal distribution with
location In geography, location or place is used to denote a region (point, line, or area) on Earth's surface. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ambiguous bou ...
\boldsymbol and known
covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one ...
. Let :_1,\dots,_n\sim \mathcal_p(\boldsymbol,) be ''n'' independent identically distributed (iid)
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s, which may be represented as p\times1 column vectors of real numbers. Define :\overline=\frac to be the
sample mean The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or me ...
with covariance _\overline=/ n. It can be shown that :(\overline-\boldsymbol)'_\overline^(\overline-\boldsymbol)\sim\chi^2_p , where \chi^2_p is the
chi-squared distribution In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
with ''p'' degrees of freedom. Alternatively, one can argue using density functions and characteristic functions, as follows.


Two-sample statistic

If _1,\dots,_\sim N_p(\boldsymbol,) and _1,\dots,_\sim N_p(\boldsymbol,), with the samples independently drawn from two
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...
multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One d ...
s with the same mean and covariance, and we define :\overline=\frac\sum_^ \mathbf_i \qquad \overline=\frac\sum_^ \mathbf_i as the sample means, and :\hat_=\frac\sum_^ (\mathbf_i-\overline)(\mathbf_i-\overline)' :\hat_=\frac\sum_^ (\mathbf_i-\overline)(\mathbf_i-\overline)' as the respective sample covariance matrices. Then :\hat= \frac is the unbiased pooled covariance matrix estimate (an extension of
pooled variance In statistics, pooled variance (also known as combined variance, composite variance, or overall variance, and written \sigma^2) is a method for estimating variance of several different populations when the mean of each population may be differen ...
). Finally, the Hotelling's two-sample ''t''-squared statistic is :t^2 = \frac(\overline-\overline)'\hat^(\overline-\overline) \sim T^2(p, n_x+n_y-2)


Related concepts

It can be related to the F-distribution by :\fract^2 \sim F(p,n_x+n_y-1-p). The non-null distribution of this statistic is the noncentral F-distribution (the ratio of a non-central Chi-squared random variable and an independent central Chi-squared random variable) :\fract^2 \sim F(p,n_x+n_y-1-p;\delta), with :\delta = \frac\boldsymbol'\mathbf^\boldsymbol, where \boldsymbol=\mathbf is the difference vector between the population means. In the two-variable case, the formula simplifies nicely allowing appreciation of how the correlation, \rho, between the variables affects t^2. If we define :d_ = \overline_-\overline_, \qquad d_ = \overline_-\overline_ and :s_1 = \sqrt \qquad s_2 = \sqrt \qquad \rho = \Sigma_/(s_1 s_2) = \Sigma_/(s_1 s_2) then :t^2 = \frac \left \left ( \frac \right )^2+\left ( \frac \right )^2-2\rho \left ( \frac \right )\left ( \frac \right ) \right Thus, if the differences in the two rows of the vector \mathbf d = \overline-\overline are of the same sign, in general, t^2 becomes smaller as \rho becomes more positive. If the differences are of opposite sign t^2 becomes larger as \rho becomes more positive. A univariate special case can be found in Welch's t-test. More robust and powerful tests than Hotelling's two-sample test have been proposed in the literature, see for example the interpoint distance based tests which can be applied also when the number of variables is comparable with, or even larger than, the number of subjects.


See also

* Student's ''t''-test in univariate statistics * Student's ''t''-distribution in univariate probability theory * Multivariate Student distribution * ''F''-distribution (commonly tabulated or available in software libraries, and hence used for testing the ''T''-squared statistic using the relationship given above) * Wilks's lambda distribution (in
multivariate statistics Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., '' multivariate random variables''. Multivariate statistics concerns understanding the differ ...
, Wilks's ''Λ'' is to Hotelling's ''T''2 as Snedecor's ''F'' is to Student's ''t'' in univariate statistics)


References


External links

* {{DEFAULTSORT:Hotelling's T-Squared Distribution Continuous distributions