Multivariate statistics is a subdivision of
statistics encompassing the simultaneous observation and analysis of more than one
outcome variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
.
Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
In addition, multivariate statistics is concerned with multivariate
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
s, in terms of both
:*how these can be used to represent the distributions of observed data;
:*how they can be used as part of
statistical inference, particularly where several different quantities are of interest to the same analysis.
Certain types of problems involving multivariate data, for example
simple linear regression
In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' an ...
and
multiple regression, are ''not'' usually considered to be special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables.
Multivariate analysis
Multivariate analysis (MVA) is based on the principles of multivariate statistics. Typically, MVA is used to address the situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important.
A modern, overlapping categorization of MVA includes:
* Normal and general multivariate models and distribution theory
* The study and measurement of relationships
* Probability computations of multidimensional regions
* The exploration of data structures and patterns
Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical "system-of-systems". Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of
surrogate models, highly accurate approximations of the physics-based code. Since surrogate models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a
Monte Carlo simulation
Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be determ ...
across the design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of
response-surface equations.
Types of analysis
There are many different models, each with its own type of analysis:
#
Multivariate analysis of variance (MANOVA) extends the
analysis of variance
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician ...
to cover cases where there is more than one dependent variable to be analyzed simultaneously; see also
Multivariate analysis of covariance (MANCOVA).
#Multivariate regression attempts to determine a formula that can describe how elements in a vector of variables respond simultaneously to changes in others. For linear relations, regression analyses here are based on forms of the
general linear model. Some suggest that multivariate regression is distinct from multivariable regression, however, that is debated and not consistently true across scientific fields.
#
Principal components analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA) creates a new set of orthogonal variables that contain the same information as the original set. It rotates the axes of variation to give a new set of orthogonal axes, ordered so that they summarize decreasing proportions of the variation.
#
Factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
is similar to PCA but allows the user to extract a specified number of synthetic variables, fewer than the original set, leaving the remaining unexplained variation as error. The extracted variables are known as latent variables or factors; each one may be supposed to account for covariation in a group of observed variables.
#
Canonical correlation analysis finds linear relationships among two sets of variables; it is the generalised (i.e. canonical) version of bivariate correlation.
#
Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user to derive a specified number of synthetic variables from one set of (independent) variables that explain as much variance as possible in another (independent) set. It is a multivariate analogue of
regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
.
#
Correspondence analysis (CA), or reciprocal averaging, finds (like PCA) a set of synthetic variables that summarise the original set. The underlying model assumes chi-squared dissimilarities among records (cases).
#
Canonical (or "constrained") correspondence analysis (CCA) for summarising the joint variation in two sets of variables (like redundancy analysis); combination of correspondence analysis and multivariate regression analysis. The underlying model assumes chi-squared dissimilarities among records (cases).
#
Multidimensional scaling
Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
comprises various algorithms to determine a set of synthetic variables that best represent the pairwise distances between records. The original method is
principal coordinates analysis (PCoA; based on PCA).
#
Discriminant analysis, or canonical variate analysis, attempts to establish whether a set of variables can be used to distinguish between two or more groups of cases.
#
Linear discriminant analysis (LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new observations.
#
Clustering systems assign objects into groups (called clusters) so that objects (cases) from the same cluster are more similar to each other than objects from different clusters.
#
Recursive partitioning creates a decision tree that attempts to correctly classify members of the population based on a dichotomous dependent variable.
#
Artificial neural networks extend regression and clustering methods to non-linear multivariate models.
#
Statistical graphics such as tours,
parallel coordinate plots, scatterplot matrices can be used to explore multivariate data.
#
Simultaneous equations models involve more than one regression equation, with different dependent variables, estimated together.
#
Vector autoregression involves simultaneous regressions of various
time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
variables on their own and each other's lagged values.
#
Principal response curves analysis (PRC) is a method based on RDA that allows the user to focus on treatment effects over time by correcting for changes in control treatments over time.
#
Iconography of correlations consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a solid line (positive correlation), or a dotted line (negative correlation).
Important probability distributions
There is a set of
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...
s used in multivariate analyses that play a similar role to the corresponding set of distributions that are used in
univariate analysis when the
normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu i ...
is appropriate to a dataset. These multivariate distributions are:
:*
Multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
:*
Wishart distribution
:*
Multivariate Student-t distribution.
The
Inverse-Wishart distribution is important in
Bayesian inference, for example in
Bayesian multivariate linear regression. Additionally,
Hotelling's T-squared distribution is a multivariate distribution, generalising
Student's t-distribution, that is used in multivariate
hypothesis testing.
History
Anderson's 1958 textbook,'' An Introduction to Multivariate Statistical Analysis'', educated a generation of theorists and applied statisticians; Anderson's book emphasizes
hypothesis testing via
likelihood ratio tests and the properties of
power functions:
admissibility,
unbiasedness and
monotonicity.
MVA once solely stood in the statistical theory realms due to the size, complexity of underlying data set and high computational consumption. With the dramatic growth of computational power, MVA now plays an increasingly important role in data analysis and has wide application in
OMICS fields.
Applications
* Multivariate hypothesis testing
*
Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
* Latent structure discovery
*
Clustering
* Multivariate regression analysis
*
Classification and discrimination analysis
*
Variable selection
*
Multidimensional analysis
*
Multidimensional scaling
Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
*
Data mining
Software and tools
There are an enormous number of software packages and other tools for multivariate analysis, including:
*
JMP (statistical software)
JMP (pronounced "jump") is a suite of computer programs for statistical analysis developed by JMP, a subsidiary of SAS Institute. It was launched in 1989 to take advantage of the graphical user interface introduced by the Macintosh operating sys ...
*
MiniTab
*
Calc
*
PSPP
*
RCRAN
has details on the packages available for multivariate data analysis
* SAS (software)
* SciPy for Python
* SPSS
* Stata
* STATISTICA
Statistica is an advanced analytics software package originally developed by StatSoft and currently maintained by TIBCO Software Inc.
Statistica provides data analysis, data management, statistics, data mining, machine learning, text analytics a ...
* The Unscrambler
The Unscrambler X is a commercial software product for multivariate data analysis, used for calibration of multivariate data which is often in the application of analytical data such as near infrared spectroscopy and Raman spectroscopy, and deve ...
* WarpPLS
* SmartPLS
* MATLAB
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
* Eviews
*NCSS (statistical software)
NCSS is a statistics package produced and distributed by NCSS, LLC. Created in 1981 by Jerry L. Hintze, NCSS, LLC specializes in providing statistical analysis software to researchers, businesses, and academic institutions. It also produces PAS ...
includes multivariate analysis.
The Unscrambler® X
is a multivariate analysis tool.
SIMCA
*DataPandit (Free SaaS applications b
Let's Excel Analytics Solutions
See also
* Estimation of covariance matrices
* Important publications in multivariate analysis
* Multivariate testing in marketing
* Structured data analysis (statistics)
* Structural equation modeling
* RV coefficient
* Bivariate analysis
* Design of experiments
The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
(DoE)
* Dimensional analysis
In engineering and science, dimensional analysis is the analysis of the relationships between different physical quantities by identifying their base quantities (such as length, mass, time, and electric current) and units of measure (such as ...
* Exploratory data analysis
* OLS
OLS or Ols may refer to:
* Oleśnica (German: Öls), Poland
* Optical landing system
* Order of Luthuli in Silver, a South African honour
* Ordinary least squares, a method used in regression analysis for estimating linear models
* Ottawa Linux Sy ...
* Partial least squares regression
* Pattern recognition
* Principal component analysis (PCA)
* Regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
* Soft independent modelling of class analogies Soft independent modelling by class analogy (SIMCA) is a statistics, statistical method for Supervised learning, supervised classification of data. The method requires a training data set consisting of samples (or objects) with a set of attributes a ...
(SIMCA)
* Statistical interference
When two probability distributions overlap, statistical interference exists. Knowledge of the distributions can be used to determine the likelihood that one parameter exceeds another, and by how much.
This technique can be used for dimensioning ...
* Univariate analysis
References
Further reading
*
*
* A. Sen, M. Srivastava, ''Regression Analysis — Theory, Methods, and Applications'', Springer-Verlag, Berlin, 2011 (4th printing).
*
* Malakooti, B. (2013). Operations and Production Systems with Multiple Objectives. John Wiley & Sons.
* T. W. Anderson, ''An Introduction to Multivariate Statistical Analysis'', Wiley, New York, 1958.
* (M.A. level "likelihood" approach)
* Feinstein, A. R. (1996) ''Multivariable Analysis''. New Haven, CT: Yale University Press.
* Hair, J. F. Jr. (1995) ''Multivariate Data Analysis with Readings'', 4th ed. Prentice-Hall.
*
* Schafer, J. L. (1997) ''Analysis of Incomplete Multivariate Data''. CRC Press. (Advanced)
* Sharma, S. (1996) ''Applied Multivariate Techniques''. Wiley. (Informal, applied)
*Izenman, Alan J. (2008). Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer Texts in Statistics. New York: Springer-Verlag. .
*"Handbook of Applied Multivariate Statistics and Mathematical Modeling , ScienceDirect". Retrieved 2019-09-03.
External links
Statnotes: Topics in Multivariate Analysis, by G. David Garson
Mike Palmer: The Ordination Web Page
InsightsNow: Makers of ReportsNow, ProfilesNow, and KnowledgeNow
{{Authority control