HOME

TheInfoList



OR:

Multiple comparisons, multiplicity or multiple testing problem occurs in
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
when one considers a set of
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
s simultaneously or
estimates In the Westminster system of government, the ''Estimates'' are an outline of government spending for the following fiscal year presented by the Cabinet (government), cabinet to parliament. The Estimates are drawn up by bureaucrats in the finance ...
a subset of parameters selected based on the observed values. The larger the number of inferences made, the more likely erroneous inferences become. Several statistical techniques have been developed to address this problem, for example, by requiring a stricter significance threshold for individual comparisons, so as to compensate for the number of inferences being made. Methods for
family-wise error rate In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. Familywise and experimentwise error rates John Tukey developed in 1953 the conce ...
give the probability of false positives resulting from the multiple comparisons problem.


History

The problem of multiple comparisons received increased attention in the 1950s with the work of statisticians such as Tukey and Scheffé. Over the ensuing decades, many procedures were developed to address the problem. In 1996, the first international conference on multiple comparison procedures took place in
Tel Aviv Tel Aviv-Yafo ( or , ; ), sometimes rendered as Tel Aviv-Jaffa, and usually referred to as just Tel Aviv, is the most populous city in the Gush Dan metropolitan area of Israel. Located on the Israeli Mediterranean coastline and with a popula ...
. This is an active research area with work being done by, for example
Emmanuel Candès Emmanuel Jean Candès (born 27 April 1970) is a French statistician most well known for his contributions to the field of compressed sensing and statistical hypothesis testing. He is a professor of statistics and electrical engineering (by court ...
and
Vladimir Vovk Vladimir Vovk is a British computer scientist, and professor at Royal Holloway University of London. He is the co-inventor of Conformal prediction and is known for his contributions to the concept of E-values. He is the co-director of the Cent ...
.


Definition

Multiple comparisons arise when a statistical analysis involves multiple simultaneous statistical tests, each of which has a potential to produce a "discovery". A stated confidence level generally applies only to each test considered individually, but often it is desirable to have a confidence level for the whole family of simultaneous tests. Failure to compensate for multiple comparisons can have important real-world consequences, as illustrated by the following examples: * Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes increasingly likely that the treatment and control groups will appear to differ on at least one attribute due to random
sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ...
alone. * Suppose we consider the efficacy of a
drug A drug is any chemical substance other than a nutrient or an essential dietary ingredient, which, when administered to a living organism, produces a biological effect. Consumption of drugs can be via insufflation (medicine), inhalation, drug i ...
in terms of the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes increasingly likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom. In both examples, as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in terms of at least one attribute. Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison. For example, if one test is performed at the 5% level and the corresponding null hypothesis is true, there is only a 5% risk of incorrectly rejecting the null hypothesis. However, if 100 tests are each conducted at the 5% level and all corresponding null hypotheses are true, the
expected number In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first moment) is a generalization of the weighted average. Informally, the expected val ...
of incorrect rejections (also known as
false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resu ...
s or
Type I error Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...
s) is 5. If the tests are statistically independent from each other (i.e. are performed on independent samples), the probability of at least one incorrect rejection is approximately 99.4%. The multiple comparisons problem also applies to confidence intervals. A single confidence interval with a 95%
coverage probability In statistical estimation theory, the coverage probability, or coverage for short, is the probability that a confidence interval or confidence region will include the true value (parameter) of interest. It can be defined as the proportion of i ...
level will contain the true value of the parameter in 95% of samples. However, if one considers 100 confidence intervals simultaneously, each with 95% coverage probability, the expected number of non-covering intervals is 5. If the intervals are statistically independent from each other, the probability that at least one interval does not contain the population parameter is 99.4%. Techniques have been developed to prevent the inflation of false positive rates and non-coverage rates that occur with multiple statistical tests.


Classification of multiple hypothesis tests


Controlling procedures


Multiple testing correction

Multiple testing correction refers to making statistical tests more stringent in order to counteract the problem of multiple testing. The best known such adjustment is the
Bonferroni correction In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem. Background The method is named for its use of the Bonferroni inequalities. Application of the method to confidence intervals was described by ...
, but other methods have been developed. Such methods are typically designed to control the
family-wise error rate In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. Familywise and experimentwise error rates John Tukey developed in 1953 the conce ...
or the
false discovery rate In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the exp ...
. If ''m'' independent comparisons are performed, the ''
family-wise error rate In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. Familywise and experimentwise error rates John Tukey developed in 1953 the conce ...
'' (FWER), is given by : \bar = 1-\left( 1-\alpha_ \right)^m. Hence, unless the tests are perfectly positively dependent (i.e., identical), \bar increases as the number of comparisons increases. If we do not assume that the comparisons are independent, then we can still say: : \bar \le m \cdot \alpha_, which follows from
Boole's inequality In probability theory, Boole's inequality, also known as the union bound, says that for any finite or countable set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the indiv ...
. Example: 0.2649=1-(1-.05)^6 \le .05 \times 6 = 0.3 There are different ways to assure that the family-wise error rate is at most \alpha. The most conservative method, which is free of dependence and distributional assumptions, is the
Bonferroni correction In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem. Background The method is named for its use of the Bonferroni inequalities. Application of the method to confidence intervals was described by ...
\alpha_\mathrm=/m. A marginally less conservative correction can be obtained by solving the equation for the family-wise error rate of m independent comparisons for \alpha_\mathrm. This yields \alpha_ = 1-^, which is known as the
Šidák correction In statistics, the Šidák correction, or Dunn–Šidák correction, is a method used to counteract the problem of multiple comparisons. It is a simple method to control the family-wise error rate. When all null hypotheses are true, the method ...
. Another procedure is the
Holm–Bonferroni method In statistics, the Holm–Bonferroni method, also called the Holm method or Bonferroni–Holm method, is used to counteract the problem of multiple comparisons. It is intended to control the family-wise error rate (FWER) and offers a simple tes ...
, which uniformly delivers more power than the simple Bonferroni correction, by testing only the lowest p-value (i=1) against the strictest criterion, and the higher p-values (i>1) against progressively less strict criteria. \alpha_\mathrm=/(m-i+1). For continuous problems, one can employ
Bayesian Thomas Bayes ( ; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian ( or ) may be either any of a range of concepts and approaches that relate to statistical methods based on Bayes' theorem Bayes ...
logic to compute m from the prior-to-posterior volume ratio. Continuous generalizations of the Bonferroni and
Šidák correction In statistics, the Šidák correction, or Dunn–Šidák correction, is a method used to counteract the problem of multiple comparisons. It is a simple method to control the family-wise error rate. When all null hypotheses are true, the method ...
are presented in.


Large-scale multiple testing

Traditional methods for multiple comparisons adjustments focus on correcting for modest numbers of comparisons, often in an
analysis of variance Analysis of variance (ANOVA) is a family of statistical methods used to compare the Mean, means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variati ...
. A different set of techniques have been developed for "large-scale multiple testing", in which thousands or even greater numbers of tests are performed. For example, in
genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
, when using technologies such as
microarray A microarray is a multiplex (assay), multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of biological interactions. It is a two-dimensional array on a Substrate (materials science), solid substrate—usu ...
s, expression levels of tens of thousands of genes can be measured, and genotypes for millions of genetic markers can be measured. Particularly in the field of
genetic association Genetic association is when one or more genotypes within a population co-occur with a phenotype, phenotypic trait association (statistics), more often than would be expected by chance occurrence. Studies of genetic association aim to test whether ...
studies, there has been a serious problem with non-replication — a result being strongly statistically significant in one study but failing to be replicated in a follow-up study. Such non-replication can have many causes, but it is widely considered that failure to fully account for the consequences of making multiple comparisons is one of the causes. It has been argued that advances in
measurement Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared to ...
and
information technology Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
have made it far easier to generate large datasets for
exploratory analysis In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but prima ...
, often leading to the testing of large numbers of hypotheses with no prior basis for expecting many of the hypotheses to be true. In this situation, very high false positive rates are expected unless multiple comparisons adjustments are made. For large-scale testing problems where the goal is to provide definitive results, the
family-wise error rate In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. Familywise and experimentwise error rates John Tukey developed in 1953 the conce ...
remains the most accepted parameter for ascribing significance levels to statistical tests. Alternatively, if a study is viewed as exploratory, or if significant results can be easily re-tested in an independent study, control of the
false discovery rate In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the exp ...
(FDR) is often preferred. The FDR, loosely defined as the expected proportion of false positives among all significant tests, allows researchers to identify a set of "candidate positives" that can be more rigorously evaluated in a follow-up study. The practice of trying many unadjusted comparisons in the hope of finding a significant one is a known problem, whether applied unintentionally or deliberately, is sometimes called "
p-hacking Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Thi ...
".


Assessing whether any alternative hypotheses are true

A basic question faced at the outset of analyzing a large set of testing results is whether there is evidence that any of the alternative hypotheses are true. One simple meta-test that can be applied when it is assumed that the tests are independent of each other is to use the
Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...
as a model for the number of significant results at a given level α that would be found when all null hypotheses are true. If the observed number of positives is substantially greater than what should be expected, this suggests that there are likely to be some true positives among the significant results. For example, if 1000 independent tests are performed, each at level α = 0.05, we expect 0.05 × 1000 = 50 significant tests to occur when all null hypotheses are true. Based on the Poisson distribution with mean 50, the probability of observing more than 61 significant tests is less than 0.05, so if more than 61 significant results are observed, it is very likely that some of them correspond to situations where the alternative hypothesis holds. A drawback of this approach is that it overstates the evidence that some of the alternative hypotheses are true when the
test statistic Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...
s are positively correlated, which commonly occurs in practice. . On the other hand, the approach remains valid even in the presence of correlation among the test statistics, as long as the Poisson distribution can be shown to provide a good approximation for the number of significant results. This scenario arises, for instance, when mining significant frequent itemsets from transactional datasets. Furthermore, a careful two stage analysis can bound the FDR at a pre-specified level. Another common approach that can be used in situations where the
test statistic Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...
s can be standardized to
Z-scores In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...
is to make a normal quantile plot of the test statistics. If the observed quantiles are markedly more dispersed than the normal quantiles, this suggests that some of the significant results may be true positives.


See also

* ''q''-value ;Key concepts *
Family-wise error rate In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. Familywise and experimentwise error rates John Tukey developed in 1953 the conce ...
* False positive rate *
False discovery rate In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the exp ...
(FDR) *
False coverage rate In statistics, a false coverage rate (FCR) is the average rate of false coverage, i.e. not covering the true parameters, among the selected intervals. The FCR gives a simultaneous coverage at a (1 − ''α'')×100% level for al ...
(FCR) *
Interval estimation In statistics, interval estimation is the use of sample data to estimate an '' interval'' of possible values of a parameter of interest. This is in contrast to point estimation, which gives a single value. The most prevalent forms of interval es ...
*
Post-hoc analysis ''Post hoc'' (sometimes written as ''post-hoc'') is a Latin phrase, meaning "after this" or "after the event". ''Post hoc'' may refer to: *Post hoc analysis, ''Post hoc'' analysis or ''post hoc'' test, statistical analyses that were not specified ...
* Experimentwise error rate *
Statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
;General methods of alpha adjustment for multiple comparisons *
Closed testing procedure In statistics, the closed testing procedure is a general method for performing more than one hypothesis test simultaneously. The closed testing principle Suppose there are ''k'' hypotheses ''H''1,..., ''H'k'' to be tested and the overall type ...
*
Bonferroni correction In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem. Background The method is named for its use of the Bonferroni inequalities. Application of the method to confidence intervals was described by ...
*Boole– Bonferroni bound * Duncan's new multiple range test *
Holm–Bonferroni method In statistics, the Holm–Bonferroni method, also called the Holm method or Bonferroni–Holm method, is used to counteract the problem of multiple comparisons. It is intended to control the family-wise error rate (FWER) and offers a simple tes ...
*
Harmonic mean p-value The harmonic mean ''p''-value (HMP) is a statistical technique for addressing the multiple comparisons problem that controls the strong-sense family-wise error rate (this claim has been disputed). It improves on the power of Bonferroni correction ...
procedure * Benjamini–Hochberg procedure *
E-values In statistical hypothesis testing, e-values quantify the evidence in the data against a null hypothesis (e.g., "the coin is fair", or, in a medical context, "this new treatment has no effect"). They serve as a more robust alternative to p-values, ...
;Related concepts *
Testing hypotheses suggested by the data In statistics, hypotheses suggested by a given dataset, when tested with the same dataset that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning (double dipping) would be involved: somethi ...
*
Texas sharpshooter fallacy The Texas sharpshooter fallacy is an informal fallacy which is committed when differences in data are ignored, but similarities are overemphasized. From this reasoning, a false conclusion is inferred. This fallacy is the philosophical or rhetorical ...
*
Model selection Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of ...
*
Look-elsewhere effect The look-elsewhere effect is a phenomenon in the statistical analysis of scientific experiments where an apparently statistically significant observation may have actually arisen by chance because of the sheer size of the parameter space to be sear ...
*
Data dredging Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Th ...


References


Further reading

* F. Bretz, T. Hothorn, P. Westfall (2010), ''Multiple Comparisons Using R'', CRC Press * S. Dudoit and M. J. van der Laan (2008), ''Multiple Testing Procedures with Application to Genomics'', Springer * * * P. H. Westfall and S. S. Young (1993), ''Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment'', Wiley * P. Westfall, R. Tobias, R. Wolfinger (2011) ''Multiple comparisons and multiple testing using SAS'', 2nd edn, SAS Institute
A gallery of examples of implausible correlations sourced by data dredging


An ''
xkcd ''xkcd'' is a serial webcomic created in 2005 by American author Randall Munroe. Sometimes styled ''XKCD'', the comic's tagline describes it as "a webcomic of romance, sarcasm, math, and language". Munroe states on the comic's website that the ...
'' comic about the multiple comparisons problem, using jelly beans and acne as an example {{Statistics