Fisher's exact test (also Fisher-Irwin test) is a

statistical significance In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...

test used in the analysis of

contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...

s. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. The test assumes that all row and column sums of the contingency table were fixed by design and tends to be conservative and underpowered outside of this setting. It is one of a class of

exact test An exact (significance) test is a statistical test such that if the null hypothesis is true, then all assumptions made during the derivation of the distribution of the test statistic are met. Using an exact test provides a significance test that ...

s, so called because the significance of the deviation from a

null hypothesis The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...

(e.g., ''p''-value) can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. The test is named after its inventor,

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

, who is said to have devised the test following a comment from Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup. He tested her claim in the "

lady tasting tea In the design of experiments in statistics, the lady tasting tea is a randomized experiment devised by Ronald Fisher and reported in his book '' The Design of Experiments'' (1935). The experiment is the original exposition of Fisher's notion of ...

" experiment.

Purpose and scope

The test is useful for

categorical data In statistics, a categorical variable (also called qualitative variable) is a variable (research), variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a ...

that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated—that is, whether Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table (discussed below). The ''p''-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a

hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a Probability distribution#Discrete probability distribution, discrete probability distribution that describes the probability of k successes (random draws for which the ...

of the numbers in the cells of the table. This setting is however rare in scientific practice and the test is conservative, when one or both margins are random variables themselves With large samples, a

chi-squared test A chi-squared test (also chi-square or test) is a Statistical hypothesis testing, statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine w ...

(or better yet, a

G-test In statistics, ''G''-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended. Formulation The general formula for ''G'' i ...

) can be used in this situation. However, the significance value it provides is only an approximation, because the

sampling distribution In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. For an arbitrarily large number of samples where each sample, involving multiple observations (data poi ...

of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. The approximation is poor when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the "expected values") being low. The usual rule for deciding whether the chi-squared approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells of a contingency table are below 5, or below 10 when there is only one

degree of freedom In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinites ...

(this rule is now known to be overly conservative). In fact, for small, sparse, or unbalanced data, the exact and asymptotic ''p''-values can be quite different and may lead to opposite conclusions concerning the hypothesis of interest.Mehta, C. R. 1995. SPSS 6.1 Exact test for Windows. Englewood Cliffs, NJ: Prentice Hall. In contrast the Fisher exact test is, as its name states, exact as long as the experimental procedure keeps the row and column totals fixed, and it can therefore be used regardless of the sample characteristics. It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate. For hand calculations, the test is feasible only in the case of a 2 × 2 contingency table. However the principle of the test can be extended to the general case of an ''m'' × ''n'' table, and some statistical packages provide a calculation (sometimes using a

Monte Carlo method Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be ...

to obtain an approximation) for the more general case. The test can also be used to quantify the ''overlap'' between two sets. For example, in enrichment analyses in statistical genetics one set of genes may be annotated for a given phenotype and the user may be interested in testing the overlap of their own set with those. In this case a 2 × 2 contingency table may be generated and Fisher's exact test applied through identifying # Genes that are provided in both lists # Genes that are provided in the first list and not the second # Genes that are provided in the second list and not the first # Genes that are not provided in either list The test assumes genes in either list are taken from a broader set of genes (e.g. all remaining genes). A ''p''-value may then be calculated, summarizing the significance of the overlap between the two lists.

Derivation

Another derivation:

Example

For example, a sample of teenagers might be divided into male and female on one hand and those who are and are not currently studying for a statistics exam on the other. For example, we hypothesize that the proportion of studying students is higher among the women than among the men, and we want to test whether any difference in proportions that we observe is significant. The data might look like this: The question we ask about these data is: Knowing that 10 of these 24 teenagers are studying and that 12 of the 24 are female, and assuming the null hypothesis that men and women are equally likely to study, what is the probability that these 10 teenagers who are studying would be so unevenly distributed between the women and the men? If we were to choose 10 of the teenagers at random, what is the probability that 9 or more of them would be among the 12 women and only 1 or fewer from among the 12 men?

First example

Before we proceed with the Fisher test, we first introduce some notations. We represent the cells by the letters ''a, b, c'' and ''d'', call the totals across rows and columns ''marginal totals'', and represent the grand total by ''n''. So the table now looks like this: Fisher showed that conditional on the margins of the table, ''a'' is distributed as a

with ''a+c'' draws from a population with ''a+b'' successes and ''c+d'' failures. The probability of obtaining such set of values is given by:

p = \frac = \frac = \frac

where

\tbinom nk

is the

binomial coefficient In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficient is indexed by a pair of integers and is written \tbinom. It is the coefficient of the t ...

and the symbol ! indicates the factorial operator. This can be seen as follows. If the marginal totals (i.e.

a+b

c+d

a+c

, and

b+d

) are known, only a single degree of freedom is left: the value e.g. of

a

suffices to deduce the other values. Now,

p=p(a)

is the probability that

a

elements are positive in a random selection (without replacement) of

a+c

elements from a larger set containing

n

elements in total out of which

a+b

are positive, which is precisely the definition of the hypergeometric distribution. With the data above (using the first of the equivalent forms), this gives:

p = / = \tfrac \approx 0.001346076

Second example

The formula above gives the exact hypergeometric probability of observing this particular arrangement of the data, assuming the given marginal totals, on the

that men and women are equally likely to be studiers. To put it another way, if we assume that the probability that a man is a studier is

\mathfrak

, the probability that a woman is a studier is also

\mathfrak

, and we assume that both men and women enter our sample independently of whether or not they are studiers, then this hypergeometric formula gives the conditional probability of observing the values ''a, b, c, d'' in the four cells, conditionally on the observed marginals (i.e., assuming the row and column totals shown in the margins of the table are given). This remains true even if men enter our sample with different probabilities than women. The requirement is merely that the two classification characteristics—gender, and studier (or not)—are not associated. For example, suppose we knew probabilities

P, Q, \mathfrak

with

P + Q = \mathfrak + \mathfrak = 1

such that (male studier, male non-studier, female studier, female non-studier) had respective probabilities

(P\mathfrak, P\mathfrak, Q\mathfrak, Q\mathfrak)

for each individual encountered under our sampling procedure. Then still, were we to calculate the distribution of cell entries conditional given marginals, we would obtain the above formula in which neither

\mathfrak

nor

P

occurs. Thus, we can calculate the exact probability of any arrangement of the 24 teenagers into the four cells of the table, but Fisher showed that to generate a significance level, we need consider only the cases where the marginal totals are the same as in the observed table, and among those, only the cases where the arrangement is as extreme as the observed arrangement, or more so. ( Barnard's test relaxes this constraint on one set of the marginal totals.) In the example, there are 11 such cases. Of these only one is more extreme in the same direction as our data; it looks like this: For this table (with extremely unequal studying proportions) the probability is

/ \approx 0.000033652

p-value tests

In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or more extreme if the

is true, we have to calculate the values of ''p'' for both these tables, and add them together. This gives a

one-tailed test In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate i ...

, with ''p'' approximately 0.001346076 + 0.000033652 = 0.001379728. For example, in the R statistical computing environment, this value can be obtained as fisher.test(rbind(c(1,9),c(11,3)), alternative="less")$p.value, or in Python, using scipy.stats.fisher_exact(table= 1,9 1,3, alternative="less") (where one receives both the prior odds ratio and the ''p''-value). This value can be interpreted as the sum of evidence provided by the observed data—or any more extreme table—for the

(that there is no difference in the proportions of studiers between men and women). The smaller the value of ''p'', the greater the evidence for rejecting the null hypothesis; so here the evidence is strong that men and women are not equally likely to be studiers. For a

two-tailed test In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if ...

we must also consider tables that are equally extreme, but in the opposite direction. Unfortunately, classification of the tables according to whether or not they are 'as extreme' is problematic. An approach used by the fisher.test function in R is to compute the ''p''-value by summing the probabilities for all tables with probabilities less than or equal to that of the observed table. In the example here, the 2-sided ''p''-value is twice the 1-sided value—but in general these can differ substantially for tables with small counts, unlike the case with test statistics that have a symmetric sampling distribution.

Controversies

Fisher's test gives exact ''p''-values, but some authors have argued that it is conservative, i.e. that its actual rejection rate is below the nominal significance level. The apparent contradiction stems from the combination of a discrete statistic with fixed significance levels. Consider the following proposal for a significance test at the 5%-level: reject the null hypothesis for each table to which Fisher's test assigns a ''p''-value equal to or smaller than 5%. Because the set of all tables is discrete, there may not be a table for which equality is achieved. If

\alpha_e

is the largest ''p''-value smaller than 5% which can actually occur for some table, then the proposed test effectively tests at the

\alpha_e

-level. For small sample sizes,

\alpha_e

might be significantly lower than 5%. While this effect occurs for any discrete statistic (not just in contingency tables, or for Fisher's test), it has been argued that the problem is compounded by the fact that Fisher's test conditions on the marginals. To avoid the problem, many authors discourage the use of fixed significance levels when dealing with discrete problems. The decision to condition on the margins of the table is also controversial. ; The ''p''-values derived from Fisher's test come from the distribution that conditions on the margin totals. In this sense, the test is exact only for the conditional distribution and not the original table where the margin totals may change from experiment to experiment. It is possible to obtain an exact ''p''-value for the 2×2 table when the margins are not held fixed. Barnard's test, for example, allows for random margins. However, some authors (including, later, Barnard himself) have criticized Barnard's test based on this property. They argue that the marginal success total is an (almost)

ancillary statistic In statistics, ancillarity is a property of a statistic computed on a sample dataset in relation to a parametric model of the dataset. An ancillary statistic has the same distribution regardless of the value of the parameters and thus provides no i ...

, containing (almost) no information about the tested property. The act of conditioning on the marginal success rate from a 2×2 table can be shown to ignore some information in the data about the unknown odds ratio. The argument that the marginal totals are (almost) ancillary implies that the appropriate likelihood function for making inferences about this odds ratio should be conditioned on the marginal success rate. Whether this lost information is important for inferential purposes is the essence of the controversy.

Alternatives

An alternative exact test, Barnard's exact test, has been developed and proponents of it suggest that this method is more powerful, particularly in 2×2 tables. Furthermore, Boschloo's test is an exact test that is uniformly more powerful than Fisher's exact test by construction. Most modern statistical packages will calculate the significance of Fisher tests, in some cases even where the chi-squared approximation would also be acceptable. The actual computations as performed by statistical software packages will as a rule differ from those described above, because numerical difficulties may result from the large values taken by the factorials. A simple, somewhat better computational approach relies on a

gamma function In mathematics, the gamma function (represented by Γ, capital Greek alphabet, Greek letter gamma) is the most common extension of the factorial function to complex numbers. Derived by Daniel Bernoulli, the gamma function \Gamma(z) is defined ...

or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities remains an active research area. For stratified categorical data the Cochran–Mantel–Haenszel test must be used instead of Fisher's test. Choi et al. propose a ''p''-value derived from the likelihood ratio test based on the conditional distribution of the

odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of event A taking place in the presence of B, and the odds of A in the absence of B ...

given the marginal success rate. This ''p''-value is inferentially consistent with classical tests of normally distributed data as well as with likelihood ratios and support intervals based on this conditional likelihood function. It is also readily computable. See also
Likelihood Ratio Statistics for 2 x 2 Tables
(Online calculator).

References

External links

Calculate Fisher's Exact Test Online

Likelihood Ratio Statistics for 2X2 Tables
{{DEFAULTSORT:Fisher's Exact Test Statistical tests for contingency tables Nonparametric statistics Ronald Fisher