In
statistics, the two-way
analysis of variance
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician ...
(ANOVA) is an extension of the
one-way ANOVA In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a technique that can be used to compare whether two sample's means are significantly different or not (using the F distribution). This technique can be used only for numeri ...
that examines the influence of two different
categorical independent variables
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
on one
continuous
Continuity or continuous may refer to:
Mathematics
* Continuity (mathematics), the opposing concept to discreteness; common examples include
** Continuous probability distribution or random variable in probability and statistics
** Continuous g ...
dependent variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or dema ...
. The two-way ANOVA not only aims at assessing the
main effect In the design of experiments and analysis of variance, a main effect is the effect of an independent variable on a dependent variable averaged across the levels of any other independent variables. The term is frequently used in the context of fact ...
of each independent variable but also if there is any
interaction
Interaction is action that occurs between two or more objects, with broad use in philosophy and the sciences. It may refer to:
Science
* Interaction hypothesis, a theory of second language acquisition
* Interaction (statistics)
* Interaction ...
between them.
History
In 1925,
Ronald Fisher
Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who ...
mentions the two-way ANOVA in his celebrated book, ''
Statistical Methods for Research Workers
''Statistical Methods for Research Workers'' is a classic book on statistics, written by the statistician R. A. Fisher. It is considered by some to be one of the 20th century's most influential books on statistical methods, together with his '' ...
'' (chapters 7 and 8). In 1934,
Frank Yates
Frank Yates FRS (12 May 1902 – 17 June 1994) was one of the pioneers of 20th-century statistics.
Biography
Yates was born in Manchester, England, the eldest of five children (and only son) of seed merchant Percy Yates and his wife Edith. H ...
published procedures for the unbalanced case. Since then, an extensive literature has been produced. The topic was reviewed in 1993 by
Yasunori Fujikoshi
Yasunori is a masculine Japanese given name.
Possible writings
Yasunori can be written using many different combinations of kanji characters. Here are some examples:
*安徳, "tranquil, benevolence"
*安紀, "tranquil, chronicle"
*安典, "tran ...
. In 2005,
Andrew Gelman
Andrew Eric Gelman (born February 11, 1965) is an American statistician and professor of statistics and political science at Columbia University.
Gelman received bachelor of science degrees in mathematics and in physics from MIT, where he was ...
proposed a different approach of ANOVA, viewed as a
multilevel model
Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...
.
Data set
Let us imagine a
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...
for which a dependent variable may be influenced by two factors which are potential sources of variation. The first factor has
levels and the second has
levels . Each combination
defines a treatment, for a total of
treatments. We represent the number of replicates for treatment
by
, and let
be the index of the replicate in this treatment .
From these data, we can build a
contingency table
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business ...
, where
and
, and the total number of replicates is equal to
.
The
experimental design
The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
is balanced if each treatment has the same number of replicates,
. In such a case, the design is also said to be orthogonal, allowing to fully distinguish the effects of both factors. We hence can write
, and
.
Model
Upon observing variation among all
data points, for instance via a
histogram
A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...
, "
probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
may be used to describe such variation". Let us hence denote by
the
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
which observed value
is the
-th measure for treatment
. The two-way ANOVA models all these variables as varying
independently and
normally around a mean,
, with a constant variance,
(
homoscedasticity
In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
):
.
Specifically, the mean of the response variable is modeled as a
linear combination of the explanatory variables:
,
where
is the grand mean,
is the additive main effect of level
from the first factor (''i''-th row in the contingency table),
is the additive main effect of level
from the second factor (''j''-th column in the contingency table) and
is the non-additive interaction effect of treatment
for samples
from both factors (cell at row ''i'' and column ''j'' in the contingency table).
Another equivalent way of describing the two-way ANOVA is by mentioning that, besides the variation explained by the factors, there remains some
statistical noise
In statistics, the fraction of variance unexplained (FVU) in the context of a regression task is the fraction of variance of the regressand (dependent variable) ''Y'' which cannot be explained, i.e., which is not correctly predicted, by the ex ...
. This amount of unexplained variation is handled via the introduction of one random variable per data point,
, called
error
An error (from the Latin ''error'', meaning "wandering") is an action which is inaccurate or incorrect. In some usages, an error is synonymous with a mistake. The etymology derives from the Latin term 'errare', meaning 'to stray'.
In statistic ...
. These
random variables are seen as deviations from the means, and are assumed to be independent and normally distributed:
.
Assumptions
Following
Gelman Gelman is a variant spelling of Helman. Notable people with the surname include:
*Alexander Gelman (born 1960), Russian-American theater director
*Alexander Isaakovich Gelman (born 1933), Russian playwright
* Andrew Gelman (born 1965), American sta ...
and
Hill
A hill is a landform that extends above the surrounding terrain. It often has a distinct summit.
Terminology
The distinction between a hill and a mountain is unclear and largely subjective, but a hill is universally considered to be not as ...
, the assumptions of the ANOVA, and more generally the
general linear model
The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regr ...
, are, in decreasing order of importance:
# the data points are relevant with respect to the scientific question under investigation;
# the mean of the response variable is influenced additively (if not interaction term) and linearly by the factors;
# the errors are independent;
# the errors have the same variance;
# the errors are normally distributed.
Parameter estimation
To ensure
identifiability
In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining ...
of parameters, we can add the following "sum-to-zero" constraints:
Hypothesis testing
In the classical approach,
testing null hypotheses (that the factors have no effect) is achieved via their
significance which requires calculating
sums of squares.
Testing if the interaction term is significant can be difficult because of the potentially-large number of
degrees of freedom
Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
.
Example
The following hypothetical example gives the yields of 15 plants subject to two different environmental variations, and three different fertilisers.
Five sums of squares are calculated:
Finally, the sums of squared deviations required for the
analysis of variance
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician ...
can be calculated.
See also
*
Analysis of variance
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician ...
*
F test
An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model th ...
(''Includes a one-way ANOVA example'')
*
Mixed model
A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. ...
*
Multivariate analysis of variance (MANOVA)
*
One-way ANOVA In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a technique that can be used to compare whether two sample's means are significantly different or not (using the F distribution). This technique can be used only for numeri ...
*
Repeated measures ANOVA
Repeated measures design is a research design that involves multiple measures of the same variable taken on the same or matched subjects either under different conditions or over two or more time periods. For instance, repeated measurements are ...
*
Tukey's test of additivity In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA ( regression analysis involving two qualitative factors) to assess whether the factor variables ( categorical variables) are additively related to ...
Notes
References
* {{cite book , author=George Casella , date=18 April 2008 , title=Statistical design , url=https://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75964-7 , publisher=
Springer
Springer or springers may refer to:
Publishers
* Springer Science+Business Media, aka Springer International Publishing, a worldwide publishing group founded in 1842 in Germany formerly known as Springer-Verlag.
** Springer Nature, a multinationa ...
, isbn=978-0-387-75965-4 , series=Springer Texts in Statistics , author-link=George Casella
Analysis of variance