The replication crisis, also known as the reproducibility or replicability crisis, refers to the growing number of published scientific results that other researchers have been unable to reproduce or verify. Because the reproducibility of empirical results is an essential part of the
scientific method
The scientific method is an Empirical evidence, empirical method for acquiring knowledge that has been referred to while doing science since at least the 17th century. Historically, it was developed through the centuries from the ancient and ...
, such failures undermine the credibility of theories that build on them and can call into question substantial parts of scientific knowledge.
The replication crisis is frequently discussed in relation to
psychology
Psychology is the scientific study of mind and behavior. Its subject matter includes the behavior of humans and nonhumans, both consciousness, conscious and Unconscious mind, unconscious phenomena, and mental processes such as thoughts, feel ...
and
medicine
Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...
, wherein considerable efforts have been undertaken to reinvestigate the results of classic studies to determine whether they are reliable, and if they turn out not to be, the reasons for the failure. Data strongly indicate that other
natural
Nature is an inherent character or constitution, particularly of the ecosphere or the universe as a whole. In this general sense nature refers to the laws, elements and phenomena of the physical world, including life. Although humans are part ...
and
social sciences
Social science (often rendered in the plural as the social sciences) is one of the branches of science, devoted to the study of society, societies and the Social relation, relationships among members within those societies. The term was former ...
are also affected.
The phrase "replication crisis" was coined in the early 2010s as part of a growing awareness of the problem. Considerations of causes and remedies have given rise to a new scientific discipline known as
metascience
Metascience (also known as meta-research) is the use of scientific methodology to study science itself. Metascience seeks to increase the quality of scientific research while reducing inefficiency. It is also known as "research on research" and ...
, which uses methods of
empirical research
Empirical research is research using empirical evidence. It is also a way of gaining knowledge by means of direct and indirect observation or experience. Empiricism values some research more than other kinds. Empirical evidence (the record of one ...
to examine empirical research practice.
Considerations about reproducibility can be placed into two categories. ''Reproducibility'', in the narrow sense, refers to reexamining and validating the analysis of a given set of data. The second category, ''
replication'', involves repeating an existing experiment or study with new, independent data to verify the original conclusions.
Background
Replication
Replication has been called "the cornerstone of science". Environmental health scientist Stefan Schmidt began a 2009 review with this description of replication:
But there is limited consensus on how to define ''replication'' and potentially related concepts.
A number of types of replication have been identified:
#''Direct'' or ''exact replication'', where an experimental procedure is repeated as closely as possible.
#''Systematic replication'', where an experimental procedure is largely repeated, with some intentional changes.
#''Conceptual replication'', where a finding or hypothesis is tested using a different procedure.
Conceptual replication allows testing for generalizability and veracity of a result or hypothesis.
''Reproducibility'' can also be distinguished from ''replication'', as referring to reproducing the same results using the same data set. Reproducibility of this type is why many researchers make their data available to others for testing.
The replication crisis does not necessarily mean these fields are unscientific. Rather, this process is part of the scientific process in which old ideas or those that cannot withstand careful scrutiny are pruned, although this pruning process is not always effective.
A hypothesis is generally considered to be supported when the results match the predicted pattern and that pattern of results is found to be
statistically significant. Results are considered significant whenever the relative frequency of the observed pattern falls below an arbitrarily chosen value (i.e. the
significance level
In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...
) when assuming the
null hypothesis
The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
is true. This generally answers the question of how unlikely results would be if no difference existed at the level of the
statistical population
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hyp ...
. If the probability associated with the
test statistic
Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...
exceeds the chosen
critical value Critical value or threshold value can refer to:
* A quantitative threshold in medicine, chemistry and physics
* Critical value (statistics), boundary of the acceptance region while testing a statistical hypothesis
* Value of a function at a crit ...
, the results are considered statistically significant.
The corresponding probability of exceeding the critical value is depicted as ''p'' < 0.05, where ''p'' (typically referred to as the "
''p''-value") is the probability level. This should result in 5% of hypotheses that are supported being false positives (an incorrect hypothesis being erroneously found correct), assuming the studies meet all of the statistical assumptions. Some fields use smaller p-values, such as ''p'' < 0.01 (1% chance of a false positive) or ''p'' < 0.001 (0.1% chance of a false positive). But a smaller chance of a false positive often requires greater sample sizes or a greater chance of a
false negative (a correct hypothesis being erroneously found incorrect). Although ''p''-value testing is the most commonly used method, it is not the only method.
Statistics
Certain terms commonly used in discussions of the replication crisis have technically precise meanings, which are presented here.
In the most common case,
null hypothesis testing, there are two hypotheses, a null hypothesis
and an alternative hypothesis
. The null hypothesis is typically of the form "X and Y are
statistically independent
Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two event (probability theory), events are independent, statistically independent, or stochastically independent if, informally s ...
". For example, the null hypothesis might be "taking drug X does ''not'' change 1-year recovery rate from disease Y", and the alternative hypothesis is that it does change.
As testing for full statistical independence is difficult, the full null hypothesis is often reduced to a ''simplified'' null hypothesis "the effect size is 0", where "
effect size
In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
" is a real number that is 0 if the ''full'' null hypothesis is true, and the larger the effect size is, the more the null hypothesis is false. For example, if X is binary, then the effect size might be defined as the change in the expectation of Y upon a change of X:
Note that the effect size as defined above might be zero even if X and Y are not independent, such as when
. Since different definitions of "effect size" capture different ways for X and Y to be dependent, there are many different definitions of effect size.
In practice, effect sizes cannot be directly observed, but must be measured by
statistical estimators. For example, the above definition of effect size is often measured by
Cohen's d estimator. The same effect size might have multiple estimators, as they have tradeoffs between
efficiency
Efficiency is the often measurable ability to avoid making mistakes or wasting materials, energy, efforts, money, and time while performing a task. In a more general sense, it is the ability to do things well, successfully, and without waste.
...
,
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
,
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
, etc. This further increases the number of possible statistical quantities that can be computed on a single dataset. When an estimator for an effect size is used for statistical testing, it is called a
test statistic
Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...
.

A null hypothesis test is a decision procedure which takes in some data, and outputs either
or
. If it outputs
, it is usually stated as "there is a statistically significant effect" or "the null hypothesis is rejected".
Often, the statistical test is a (one-sided) threshold test, which is structured as follows:
# Gather data
.
# Compute a test statistic