statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical

independent variables A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

on one

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...

dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...

. The two-way ANOVA not only aims at assessing the

main effect In the design of experiments and analysis of variance, a main effect is the effect of an independent variable on a dependent variable averaged across the levels of any other independent variables. The term is frequently used in the context of fact ...

of each independent variable but also if there is any interaction between them.

History

In 1925,

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

mentions the two-way ANOVA in his celebrated book, ''

Statistical Methods for Research Workers ''Statistical Methods for Research Workers'' is a classic book on statistics, written by the statistician R. A. Fisher. It is considered by some to be one of the 20th century's most influential books on statistical methods, together with his '' T ...

'' (chapters 7 and 8). In 1934,

Frank Yates Frank Yates FRS (12 May 1902 – 17 June 1994) was one of the pioneers of 20th-century statistics. Biography Yates was born in Manchester, England, the eldest of five children (and only son) of seed merchant and botanist Percy Yates and ...

published procedures for the unbalanced case. Since then, an extensive literature has been produced. The topic was reviewed in 1993 by Yasunori Fujikoshi. In 2005,

Andrew Gelman Andrew Eric Gelman (born February 11, 1965) is an American statistician who is Higgins Professor of Statistics and a professor of political science at Columbia University. Gelman attended the Massachusetts Institute of Technology as a National M ...

proposed a different approach of ANOVA, viewed as a

multilevel model Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the studen ...

Data set

Let us imagine a

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...

for which a dependent variable may be influenced by two factors which are potential sources of variation. The first factor has

I

levels and the second has

J

levels . Each combination

(i,j)

defines a treatment, for a total of

I \times J

treatments. We represent the number of replicates for treatment

(i,j)

n_

, and let

k

be the index of the replicate in this treatment . From these data, we can build a

contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...

, where

n_ = \sum_^J n_

and

n_ = \sum_^I n_

, and the total number of replicates is equal to

n = \sum_ n_ = \sum_i n_ = \sum_j n_

. The

experimental design The design of experiments (DOE), also known as experiment design or experimental design, is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. ...

is balanced if each treatment has the same number of replicates,

K

. In such a case, the design is also said to be orthogonal, allowing to fully distinguish the effects of both factors. We hence can write

\forall i,j \; n_ = K

, and

\forall i,j \; n_ = \frac

Model

Upon observing variation among all

n

data points, for instance via a

histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...

, "

probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

may be used to describe such variation". Let us hence denote by

Y_

the

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

which observed value

y_

is the

k

-th measure for treatment

(i,j)

. The two-way ANOVA models all these variables as varying independently and normally around a mean,

\mu_

, with a constant variance,

\sigma^2

(

homoscedasticity In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...

Y_ \, ,  \, \mu_, \sigma^2 \; \overset \; \mathcal(\mu_, \sigma^2)

. Specifically, the mean of the response variable is modeled as a

linear combination In mathematics, a linear combination or superposition is an Expression (mathematics), expression constructed from a Set (mathematics), set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of ''x'' a ...

of the explanatory variables:

\mu_ = \mu + \alpha_i + \beta_j + \gamma_

, where

\mu

is the grand mean,

\alpha_i

is the additive main effect of level

i

from the first factor (''i''-th row in the contingency table),

\beta_j

is the additive main effect of level

j

from the second factor (''j''-th column in the contingency table) and

\gamma_

is the non-additive interaction effect of treatment

(i,j)

for samples

k=1,...,n_

from both factors (cell at row ''i'' and column ''j'' in the contingency table). Another equivalent way of describing the two-way ANOVA is by mentioning that, besides the variation explained by the factors, there remains some

statistical noise In statistics, the fraction of variance unexplained (FVU) in the context of a regression task is the fraction of variance of the regressand (dependent variable) ''Y'' which cannot be explained, i.e., which is not correctly predicted, by the ex ...

. This amount of unexplained variation is handled via the introduction of one random variable per data point,

\epsilon_

, called

error An error (from the Latin , meaning 'to wander'Oxford English Dictionary, s.v. “error (n.), Etymology,” September 2023, .) is an inaccurate or incorrect action, thought, or judgement. In statistics, "error" refers to the difference between t ...

. These

n

random variables are seen as deviations from the means, and are assumed to be independent and normally distributed:

Y_ = \mu_ + \epsilon_ \text \epsilon_ \overset \mathcal(0, \sigma^2)

Assumptions

Following

Gelman Gelman is a variant spelling of Helman. Notable people with the surname include: * Alexander Gelman (born 1960), Russian-American theater director *Alexander Isaakovich Gelman (born 1933), Russian playwright *Andrew Gelman (born 1965), American st ...

and

Hill A hill is a landform that extends above the surrounding terrain. It often has a distinct summit, and is usually applied to peaks which are above elevation compared to the relative landmass, though not as prominent as Mountain, mountains. Hills ...

, the assumptions of the ANOVA, and more generally the

general linear model The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regre ...

, are, in decreasing order of importance: # the data points are relevant with respect to the scientific question under investigation; # the mean of the response variable is influenced additively (if not interaction term) and linearly by the factors; # the errors are independent; # the errors have the same variance; # the errors are normally distributed.

Parameter estimation

To ensure

identifiability In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining a ...

of parameters, we can add the following "sum-to-zero" constraints:

\sum_i \alpha_i = \sum_j \beta_j = \sum_i \gamma_ =\sum_j \gamma_= 0

Hypothesis testing

In the classical approach, testing null hypotheses (that the factors have no effect) is achieved via their significance which requires calculating sums of squares. Testing if the interaction term is significant can be difficult because of the potentially-large number of

degrees of freedom In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...

Example

The following hypothetical example gives the yields of 15 plants subject to two different environmental variations, and three different fertilisers. Five sums of squares are calculated: Finally, the sums of squared deviations required for the

analysis of variance Analysis of variance (ANOVA) is a family of statistical methods used to compare the Mean, means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variati ...

can be calculated.

Notes

References

* {{cite book , author=George Casella , date=18 April 2008 , title=Statistical design , url=https://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-75964-7 , publisher=

Springer Springer or springers may refer to: Publishers * Springer Science+Business Media, aka Springer International Publishing, a worldwide publishing group founded in 1842 in Germany formerly known as Springer-Verlag. ** Springer Nature, a multinationa ...

, isbn=978-0-387-75965-4 , series=Springer Texts in Statistics , author-link=George Casella Analysis of variance