Multivariate statistics is a subdivision of

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., ''

multivariate random variable In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...

s''. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied. In addition, multivariate statistics is concerned with multivariate

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

s, in terms of both :*how these can be used to represent the distributions of observed data; :*how they can be used as part of

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

, particularly where several different quantities are of interest to the same analysis. Certain types of problems involving multivariate data, for example

simple linear regression In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...

and multiple regression, are ''not'' usually considered to be special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables.

Multivariate analysis

Multivariate analysis (MVA) is based on the principles of multivariate statistics. Typically, MVA is used to address situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important. A modern, overlapping categorization of MVA includes: * Normal and general multivariate models and distribution theory * The study and measurement of relationships *

Probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

computations of multidimensional regions * The exploration of

data structures In computer science, a data structure is a data organization and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functi ...

and

patterns A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated li ...

Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical "system-of-systems". Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of surrogate models, highly accurate approximations of the physics-based code. Since surrogate models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a

Monte Carlo simulation Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be det ...

across the design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of response-surface equations.

Types of analysis

Many different models are used in MVA, each with its own type of analysis: #

Multivariate analysis of variance In statistics, multivariate analysis of variance (MANOVA) is a procedure for comparing multivariate random variable, multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is often fo ...

(MANOVA) extends the

analysis of variance Analysis of variance (ANOVA) is a family of statistical methods used to compare the Mean, means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variati ...

to cover cases where there is more than one dependent variable to be analyzed simultaneously; see also

Multivariate analysis of covariance Multivariate analysis of covariance (MANCOVA) is an extension of analysis of covariance ( ANCOVA) methods to cover cases where there is more than one dependent variable and where the control of concomitant continuous independent variables – covar ...

(MANCOVA). #Multivariate regression attempts to determine a formula that can describe how elements in a

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

of variables respond simultaneously to changes in others. For linear relations, regression analyses here are based on forms of the

general linear model The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regre ...

. Some suggest that multivariate regression is distinct from multivariable regression, however, that is debated and not consistently true across scientific fields. #

Principal components analysis Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing. The data is linear map, linearly transformed ...

(PCA) creates a new set of

orthogonal In mathematics, orthogonality (mathematics), orthogonality is the generalization of the geometric notion of ''perpendicularity''. Although many authors use the two terms ''perpendicular'' and ''orthogonal'' interchangeably, the term ''perpendic ...

variables that contain the same information as the original set. It rotates the axes of variation to give a new set of orthogonal axes, ordered so that they summarize decreasing proportions of the variation. #

Factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observe ...

is similar to PCA but allows the user to extract a specified number of synthetic variables, fewer than the original set, leaving the remaining unexplained variation as error. The extracted variables are known as latent variables or factors; each one may be supposed to account for covariation in a group of observed variables. #

Canonical correlation analysis In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors ''X'' = (''X''1, ..., ''X'n'') and ''Y'' ...

finds linear relationships among two sets of variables; it is the generalised (i.e. canonical) version of bivariate correlation. # Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user to derive a specified number of synthetic variables from one set of (independent) variables that explain as much variance as possible in another (independent) set. It is a multivariate analogue of regression. #

Correspondence analysis Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical ...

(CA), or reciprocal averaging, finds (like PCA) a set of synthetic variables that summarise the original set. The underlying model assumes chi-squared dissimilarities among records (cases). # Canonical (or "constrained") correspondence analysis (CCA) for summarising the joint variation in two sets of variables (like redundancy analysis); combination of correspondence analysis and multivariate regression analysis. The underlying model assumes chi-squared dissimilarities among records (cases). #

Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of n objects in a set into a configuration of n points mapped into an ...

comprises various algorithms to determine a set of synthetic variables that best represent the pairwise distances between records. The original method is

principal coordinates analysis Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of n objects in a set into a configuration of n points mapped into an ...

(PCoA; based on PCA). #

Discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

, or canonical variate analysis, attempts to establish whether a set of variables can be used to distinguish between two or more groups of cases. #

Linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

(LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new observations. # Clustering systems assign objects into groups (called clusters) so that objects (cases) from the same cluster are more similar to each other than objects from different clusters. #

Recursive partitioning Recursive partitioning is a statistics, statistical method for multivariable analysis. Recursive partitioning creates a Decision tree learning, decision tree that strives to correctly classify members of the population by splitting it into sub-popu ...

creates a decision tree that attempts to correctly classify members of the population based on a dichotomous dependent variable. #

Artificial neural networks In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

extend regression and clustering methods to non-linear multivariate models. #

Statistical graphics Statistical graphics, also known as statistical graphical techniques, are graphics used in the field of statistics for data visualization. Overview Whereas statistics and data analysis procedures generally yield their output in numeric or tabul ...

such as tours, parallel coordinate plots, scatterplot matrices can be used to explore multivariate data. #

Simultaneous equations model Simultaneous equations models are a type of statistical model in which the dependent variables are functions of other dependent variables, rather than just independent variables. This means some of the explanatory variables are jointly determined ...

s involve more than one regression equation, with different dependent variables, estimated together. # Vector autoregression involves simultaneous regressions of various

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...

variables on their own and each other's lagged values. # Principal response curves analysis (PRC) is a method based on RDA that allows the user to focus on treatment effects over time by correcting for changes in control treatments over time. # Iconography of correlations consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a solid line (positive correlation), or a dotted line (negative correlation).

Dealing with incomplete data

It is very common that in an experimentally acquired set of data the values of some components of a given data point are missing. Rather than discarding the whole data point, it is common to "fill in" values for the missing components, a process called " imputation".

Important probability distributions

There is a set of

s used in multivariate analyses that play a similar role to the corresponding set of distributions that are used in

univariate analysis Univariate is a term commonly used in statistics to describe a type of data which consists of observations on only a single characteristic or attribute. A simple example of univariate data would be the salaries of workers in industry. Like all the ...

when the

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

is appropriate to a dataset. These multivariate distributions are: :*

Multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One d ...

Wishart distribution In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart (statistician), John Wishart, who first formulated the distribution in 1928. Other names include Wi ...

:* Multivariate Student-t distribution. The

Inverse-Wishart distribution In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the cov ...

is important in

Bayesian inference Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...

, for example in Bayesian multivariate linear regression. Additionally, Hotelling's T-squared distribution is a multivariate distribution, generalising

Student's t-distribution In probability theory and statistics, Student's distribution (or simply the distribution) t_\nu is a continuous probability distribution that generalizes the Normal distribution#Standard normal distribution, standard normal distribu ...

, that is used in multivariate

hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

History

C.R. Rao made significant contributions to multivariate statistical theory throughout his career, particularly in the mid-20th century. One of his key works is the book titled "Advanced Statistical Methods in Biometric Research," published in 1952. This work laid the foundation for many concepts in multivariate statistics. Anderson's 1958 textbook,'' An Introduction to Multivariate Statistical Analysis'', educated a generation of theorists and applied statisticians; Anderson's book emphasizes

via likelihood ratio tests and the properties of

power function In mathematics, exponentiation, denoted , is an operation involving two numbers: the ''base'', , and the ''exponent'' or ''power'', . When is a positive integer, exponentiation corresponds to repeated multiplication of the base: that is, i ...

s: admissibility,

unbiasedness In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

and

monotonicity In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of orde ...

. MVA was formerly discussed solely in the context of statistical theories, due to the size and complexity of underlying datasets and its high computational consumption. With the dramatic growth of computational power, MVA now plays an increasingly important role in data analysis and has wide application in

Omics Omics is the collective characterization and quantification of entire sets of biological molecules and the investigation of how they translate into the structure, function, and dynamics of an organism or group of organisms. The branches of scien ...

fields.

Applications

* Multivariate hypothesis testing *

Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

* Latent structure discovery * Clustering * Multivariate regression analysis * Classification and discrimination analysis * Variable selection * Multidimensional analysis *

Data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

Software and tools

There are an enormous number of software packages and other tools for multivariate analysis, including: * JMP (statistical software) * MiniTab * Calc * PSPP * RCRAN
has details on the packages available for multivariate data analysis *

SAS (software) SAS (previously "Statistical Analysis System") is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, and predictive analytics. SAS was developed at No ...

SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier ...

for Python *

SPSS SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versi ...

Stata Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose Statistics, statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers ...

* STATISTICA * The Unscrambler * WarpPLS * SmartPLS *

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

* Eviews *

NCSS (statistical software) NCSS is a statistics package produced and distributed by NCSS, LLC. Created in 1981 by Jerry L. Hintze, NCSS, LLC specializes in providing statistical analysis software to researchers, businesses, and academic institutions. It also produces PA ...

includes multivariate analysis.
The Unscrambler® X
is a multivariate analysis tool.
SIMCA
*DataPandit (Free SaaS applications b
Let's Excel Analytics Solutions

References

External links

Statnotes: Topics in Multivariate Analysis, by G. David Garson

Mike Palmer: The Ordination Web Page

InsightsNow: Makers of ReportsNow, ProfilesNow, and KnowledgeNow
{{Authority control