HOME

TheInfoList



OR:

The replication crisis, also known as the reproducibility or replicability crisis, refers to the growing number of published scientific results that other researchers have been unable to reproduce or verify. Because the reproducibility of empirical results is an essential part of the
scientific method The scientific method is an Empirical evidence, empirical method for acquiring knowledge that has been referred to while doing science since at least the 17th century. Historically, it was developed through the centuries from the ancient and ...
, such failures undermine the credibility of theories that build on them and can call into question substantial parts of scientific knowledge. The replication crisis is frequently discussed in relation to
psychology Psychology is the scientific study of mind and behavior. Its subject matter includes the behavior of humans and nonhumans, both consciousness, conscious and Unconscious mind, unconscious phenomena, and mental processes such as thoughts, feel ...
and
medicine Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...
, wherein considerable efforts have been undertaken to reinvestigate the results of classic studies to determine whether they are reliable, and if they turn out not to be, the reasons for the failure. Data strongly indicate that other
natural Nature is an inherent character or constitution, particularly of the ecosphere or the universe as a whole. In this general sense nature refers to the laws, elements and phenomena of the physical world, including life. Although humans are part ...
and
social sciences Social science (often rendered in the plural as the social sciences) is one of the branches of science, devoted to the study of society, societies and the Social relation, relationships among members within those societies. The term was former ...
are also affected. The phrase "replication crisis" was coined in the early 2010s as part of a growing awareness of the problem. Considerations of causes and remedies have given rise to a new scientific discipline known as
metascience Metascience (also known as meta-research) is the use of scientific methodology to study science itself. Metascience seeks to increase the quality of scientific research while reducing inefficiency. It is also known as "research on research" and ...
, which uses methods of
empirical research Empirical research is research using empirical evidence. It is also a way of gaining knowledge by means of direct and indirect observation or experience. Empiricism values some research more than other kinds. Empirical evidence (the record of one ...
to examine empirical research practice. Considerations about reproducibility can be placed into two categories. ''Reproducibility'', in the narrow sense, refers to reexamining and validating the analysis of a given set of data. The second category, '' replication'', involves repeating an existing experiment or study with new, independent data to verify the original conclusions.


Background


Replication

Replication has been called "the cornerstone of science". Environmental health scientist Stefan Schmidt began a 2009 review with this description of replication: But there is limited consensus on how to define ''replication'' and potentially related concepts. A number of types of replication have been identified: #''Direct'' or ''exact replication'', where an experimental procedure is repeated as closely as possible. #''Systematic replication'', where an experimental procedure is largely repeated, with some intentional changes. #''Conceptual replication'', where a finding or hypothesis is tested using a different procedure. Conceptual replication allows testing for generalizability and veracity of a result or hypothesis. ''Reproducibility'' can also be distinguished from ''replication'', as referring to reproducing the same results using the same data set. Reproducibility of this type is why many researchers make their data available to others for testing. The replication crisis does not necessarily mean these fields are unscientific. Rather, this process is part of the scientific process in which old ideas or those that cannot withstand careful scrutiny are pruned, although this pruning process is not always effective. A hypothesis is generally considered to be supported when the results match the predicted pattern and that pattern of results is found to be statistically significant. Results are considered significant whenever the relative frequency of the observed pattern falls below an arbitrarily chosen value (i.e. the
significance level In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...
) when assuming the
null hypothesis The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
is true. This generally answers the question of how unlikely results would be if no difference existed at the level of the
statistical population In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hyp ...
. If the probability associated with the
test statistic Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...
exceeds the chosen
critical value Critical value or threshold value can refer to: * A quantitative threshold in medicine, chemistry and physics * Critical value (statistics), boundary of the acceptance region while testing a statistical hypothesis * Value of a function at a crit ...
, the results are considered statistically significant. The corresponding probability of exceeding the critical value is depicted as ''p'' < 0.05, where ''p'' (typically referred to as the " ''p''-value") is the probability level. This should result in 5% of hypotheses that are supported being false positives (an incorrect hypothesis being erroneously found correct), assuming the studies meet all of the statistical assumptions. Some fields use smaller p-values, such as ''p'' < 0.01 (1% chance of a false positive) or ''p'' < 0.001 (0.1% chance of a false positive). But a smaller chance of a false positive often requires greater sample sizes or a greater chance of a false negative (a correct hypothesis being erroneously found incorrect). Although ''p''-value testing is the most commonly used method, it is not the only method.


Statistics

Certain terms commonly used in discussions of the replication crisis have technically precise meanings, which are presented here. In the most common case, null hypothesis testing, there are two hypotheses, a null hypothesis H_0 and an alternative hypothesis H_1. The null hypothesis is typically of the form "X and Y are
statistically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two event (probability theory), events are independent, statistically independent, or stochastically independent if, informally s ...
". For example, the null hypothesis might be "taking drug X does ''not'' change 1-year recovery rate from disease Y", and the alternative hypothesis is that it does change. As testing for full statistical independence is difficult, the full null hypothesis is often reduced to a ''simplified'' null hypothesis "the effect size is 0", where "
effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
" is a real number that is 0 if the ''full'' null hypothesis is true, and the larger the effect size is, the more the null hypothesis is false. For example, if X is binary, then the effect size might be defined as the change in the expectation of Y upon a change of X:(\text) = \mathbb X=1- \mathbb X=0Note that the effect size as defined above might be zero even if X and Y are not independent, such as when Y \sim \mathcal N(0, 1+X). Since different definitions of "effect size" capture different ways for X and Y to be dependent, there are many different definitions of effect size. In practice, effect sizes cannot be directly observed, but must be measured by statistical estimators. For example, the above definition of effect size is often measured by Cohen's d estimator. The same effect size might have multiple estimators, as they have tradeoffs between
efficiency Efficiency is the often measurable ability to avoid making mistakes or wasting materials, energy, efforts, money, and time while performing a task. In a more general sense, it is the ability to do things well, successfully, and without waste. ...
,
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
,
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
, etc. This further increases the number of possible statistical quantities that can be computed on a single dataset. When an estimator for an effect size is used for statistical testing, it is called a
test statistic Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...
. A null hypothesis test is a decision procedure which takes in some data, and outputs either H_0 or H_1. If it outputs H_1, it is usually stated as "there is a statistically significant effect" or "the null hypothesis is rejected". Often, the statistical test is a (one-sided) threshold test, which is structured as follows: # Gather data D. # Compute a test statistic t /math> for the data. # Compare the test statistic against a critical value/threshold t_. If t > t_, then output H_1, else, output H_0. A two-sided threshold test is similar, but with two thresholds, such that it outputs H_1 if either t < t_^- or t > t_^+ There are 4 possible outcomes of a null hypothesis test: false negative, true negative, false positive, true positive. A false negative means that H_0 is true, but the test outcome is H_1; a true negative means that H_0 is true, and the test outcome is H_0, etc. Significance level, false positive rate, or the alpha level, is the probability of finding the alternative to be true when the null hypothesis is true:(\text) := \alpha := Pr(\text H_1 , H_0)For example, when the test is a one-sided threshold test, then \alpha = Pr_(t > t_) where D\sim H_0 means "the data is sampled from H_0".
Statistical power In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...
, true positive rate, is the probability of finding the alternative to be true when the alternative hypothesis is true:(\text) := 1-\beta := Pr(\text H_1 , H_1) where \beta is also called the false negative rate. For example, when the test is a one-sided threshold test, then 1-\beta = Pr_(t > t_). Given a statistical test and a data set D, the corresponding
p-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
is the probability that the test statistic is at least as extreme, conditional on H_0. For example, for a one-sided threshold test, p = Pr_(t '> t If the null hypothesis is true, then the p-value is distributed uniformly on
, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
/math>. Otherwise, it is typically peaked at p = 0.0 and roughly exponential, though the precise shape of the p-value distribution depends on what the alternative hypothesis is. Since the p-value is distributed uniformly on
, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
/math> conditional on the null hypothesis, one may construct a statistical test with any significance level \alpha by simply computing the p-value, then output H_1 if p < \alpha. This is usually stated as "the null hypothesis is rejected at significance level \alpha", or "H_1 \; (p < \alpha)", such as "smoking is correlated with cancer (p < 0.001)".


History

The beginning of the replication crisis can be traced to a number of events in the early 2010s. Philosopher of science and social epistemologist Felipe Romero identified four events that can be considered precursors to the ongoing crisis: * Controversies around social priming research: In the early 2010s, the well-known "elderly-walking" study by social psychologist John Bargh and colleagues failed to replicate in two direct replications. This experiment was part of a series of three studies that had been widely cited throughout the years, was regularly taught in university courses, and had inspired a large number of conceptual replications. Failures to replicate the study led to much controversy and a heated debate involving the original authors. Notably, many of the conceptual replications of the original studies also failed to replicate in subsequent direct replications. * Controversies around experiments on extrasensory perception: Social psychologist
Daryl Bem Daryl J. Bem (born June 10, 1938) is a social psychologist and professor emeritus at Cornell University. He is the originator of the self-perception theory of attitude formation and change. He has also researched psi phenomena, group decision ma ...
conducted a series of experiments supposedly providing evidence for the controversial phenomenon of
extrasensory perception Extrasensory perception (ESP), also known as a sixth sense, or cryptaesthesia, is a claimed paranormal ability pertaining to reception of information not gained through the recognized physical senses, but sensed with the mind. The term was ad ...
. Bem was highly criticized for his study's methodology and upon reanalysis of the data, no evidence was found for the existence of extrasensory perception. The experiment also failed to replicate in subsequent direct replications. According to Romero, what the community found particularly upsetting was that many of the flawed procedures and statistical tools used in Bem's studies were part of common research practice in psychology. * Amgen and Bayer reports on lack of replicability in biomedical research: Scientists from biotech companies
Amgen Amgen Inc. (formerly Applied Molecular Genetics Inc.) is an American multinational biopharmaceutical Corporation, company headquartered in Thousand Oaks, California. As one of the world's largest independent biotechnology companies, Amgen has a ...
and Bayer Healthcare reported alarmingly low replication rates (11–20%) of landmark findings in preclinical oncological research. * Publication of studies on p-hacking and questionable research practices: Since the late 2000s, a number of studies in
metascience Metascience (also known as meta-research) is the use of scientific methodology to study science itself. Metascience seeks to increase the quality of scientific research while reducing inefficiency. It is also known as "research on research" and ...
showed how commonly adopted practices in many scientific fields, such as exploiting the flexibility of the process of data collection and reporting, could greatly increase the probability of false positive results. These studies suggested how a significant proportion of published literature in several scientific fields could be nonreplicable research. This series of events generated a great deal of skepticism about the validity of existing research in light of widespread methodological flaws and failures to replicate findings. This led prominent scholars to declare a "crisis of confidence" in psychology and other fields, and the ensuing situation came to be known as the "replication crisis". Although the beginning of the replication crisis can be traced to the early 2010s, some authors point out that concerns about replicability and research practices in the social sciences had been expressed much earlier. Romero notes that authors voiced concerns about the lack of direct replications in psychological research in the late 1960s and early 1970s. He also writes that certain studies in the 1990s were already reporting that journal editors and reviewers are generally biased against publishing replication studies. In the social sciences, the blog
Data Colada Data Colada is a blog dedicated to investigative analysis and Replication crisis, replication of academic research, focusing in particular on the validity of findings in the Social science, social sciences. It is known for its advocacy against pr ...
(whose three authors coined the term "
p-hacking Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Thi ...
" in a 2014 paper) has been credited with contributing to the start of the replication crisis. University of Virginia professor and cognitive psychologist Barbara A. Spellman has written that many criticisms of research practices and concerns about replicability of research are not new. She reports that between the late 1950s and the 1990s, scholars were already expressing concerns about a possible crisis of replication, a suspiciously high rate of positive findings, questionable research practices (QRPs), the effects of publication bias, issues with statistical power, and bad standards of reporting. Spellman also identifies reasons that the reiteration of these criticisms and concerns in recent years led to a full-blown crisis and challenges to the status quo. First, technological improvements facilitated conducting and disseminating replication studies, and analyzing large swaths of literature for systemic problems. Second, the research community's increasing size and diversity made the work of established members more easily scrutinized by other community members unfamiliar with them. According to Spellman, these factors, coupled with increasingly limited resources and misaligned incentives for doing scientific work, led to a crisis in psychology and other fields. According to Andrew Gelman, the works of
Paul Meehl Paul Everett Meehl (3 January 1920 – 14 February 2003) was an American clinical psychologist. He was the Hathaway and Regents' Professor of Psychology at the University of Minnesota, and past president of the American Psychological Association ...
, Jacob Cohen, and Tversky and Kahneman in the 1960s-70s were early warnings of replication crisis. In discussing the origins of the problem, Kahneman himself noted historical precedents in subliminal perception and dissonance reduction replication failures. It had been repeatedly pointed out since 1962 that most psychological studies have low power (true positive rate), but low power persisted for 50 years, indicating a structural and persistent problem in psychological research.


Prevalence


In psychology

Several factors have combined to put psychology at the center of the conversation. Some areas of psychology once considered solid, such as social priming and
ego depletion Ego depletion is the idea that self-control or willpower draws upon conscious mental resources that can be taxed to exhaustion when in constant use with no reprieve (with the word "ego" used in the psychoanalytic sense rather than the colloquial ...
, have come under increased scrutiny due to failed replications. Much of the focus has been on
social psychology Social psychology is the methodical study of how thoughts, feelings, and behaviors are influenced by the actual, imagined, or implied presence of others. Although studying many of the same substantive topics as its counterpart in the field ...
, although other areas of psychology such as
clinical psychology Clinical psychology is an integration of human science, behavioral science, theory, and clinical knowledge for the purpose of understanding, preventing, and relieving psychologically-based distress or dysfunction and to promote subjective well ...
,
developmental psychology Developmental psychology is the scientific study of how and why humans grow, change, and adapt across the course of their lives. Originally concerned with infants and children, the field has expanded to include adolescence, adult development ...
, and
educational research Educational research refers to the systematic collection and analysis of evidence and data related to the field of education. Research may involve a variety of methods and various aspects of education including student learning, interaction, tea ...
have also been implicated. In August 2015, the first open empirical study of reproducibility in psychology was published, called The Reproducibility Project: Psychology. Coordinated by psychologist Brian Nosek, researchers redid 100 studies in psychological science from three high-ranking psychology journals (''
Journal of Personality and Social Psychology The ''Journal of Personality and Social Psychology'' is a monthly peer-reviewed scientific journal published by the American Psychological Association that was established in 1965. It covers the fields of social and personality psychology. The edi ...
'', '' Journal of Experimental Psychology: Learning, Memory, and Cognition'', and ''
Psychological Science ''Psychological Science'', the flagship journal of the Association for Psychological Science, is a monthly, peer-reviewed scientific journal published by SAGE Publications. The journal publishes research articles, short reports, and research repor ...
''). 97 of the original studies had significant effects, but of those 97, only 36% of the replications yielded significant findings (''p'' value below 0.05). The mean
effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
in the replications was approximately half the magnitude of the effects reported in the original studies. The same paper examined the reproducibility rates and effect sizes by journal and discipline. Study replication rates were 23% for the ''Journal of Personality and Social Psychology'', 48% for ''Journal of Experimental Psychology: Learning, Memory, and Cognition'', and 38% for ''Psychological Science''. Studies in the field of cognitive psychology had a higher replication rate (50%) than studies in the field of social psychology (25%). Of the 64% of non-replications, only 25% disproved the original result (at statistical significance). The other 49% were inconclusive, neither supporting nor contradicting the original result. This is because many replications were underpowered, with a sample 2.5 times smaller than the original. A study published in 2018 in ''
Nature Human Behaviour ''Nature Human Behaviour'' is a monthly multidisciplinary online-only peer-reviewed scientific journal covering all aspects of human behaviour. It was established in January 2017 and is published by Nature Portfolio. The editor-in-chief is Stav ...
'' replicated 21 social and behavioral science papers from ''
Nature Nature is an inherent character or constitution, particularly of the Ecosphere (planetary), ecosphere or the universe as a whole. In this general sense nature refers to the Scientific law, laws, elements and phenomenon, phenomena of the physic ...
'' and ''
Science Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...
,'' finding that only about 62% could successfully reproduce original results. Similarly, in a study conducted under the auspices of the
Center for Open Science The Center for Open Science is a non-profit technology organization based in Charlottesville, Virginia with a mission to "increase the openness, integrity, and reproducibility of scientific research." Brian Nosek and Jeffrey Spies founded the o ...
, a team of 186 researchers from 60 different laboratories (representing 36 different nationalities from six different continents) conducted replications of 28 classic and contemporary findings in psychology. The study's focus was not only whether the original papers' findings replicated but also the extent to which findings varied as a function of variations in samples and contexts. Overall, 50% of the 28 findings failed to replicate despite massive sample sizes. But if a finding replicated, then it replicated in most samples. If a finding was not replicated, then it failed to replicate with little variation across samples and contexts. This evidence is inconsistent with a proposed explanation that failures to replicate in psychology are likely due to changes in the sample between the original and replication study. Results of a 2022 study suggest that many earlier
brain The brain is an organ (biology), organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head (cephalization), usually near organs for ...
phenotype In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology (physical form and structure), its developmental processes, its biochemical and physiological propert ...
studies ("brain-wide association studies" (BWAS)) produced invalid conclusions as the replication of such studies requires samples from thousands of individuals due to small
effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
s.


In medicine

Of 49 medical studies from 1990 to 2003 with more than 1000 citations, 92% found that the studied therapies were effective. Of these studies, 16% were contradicted by subsequent studies, 16% had found stronger effects than did subsequent studies, 44% were replicated, and 24% remained largely unchallenged. A 2011 analysis by researchers with pharmaceutical company
Bayer Bayer AG (English: , commonly pronounced ; ) is a German multinational pharmaceutical and biotechnology company and is one of the largest pharmaceutical companies and biomedical companies in the world. Headquartered in Leverkusen, Bayer' ...
found that, at most, a quarter of Bayer's in-house findings replicated the original results. But the analysis of Bayer's results found that the results that did replicate could often be successfully used for clinical applications. Updated on 14 June 2017 In a 2012 paper, C. Glenn Begley, a biotech consultant working at
Amgen Amgen Inc. (formerly Applied Molecular Genetics Inc.) is an American multinational biopharmaceutical Corporation, company headquartered in Thousand Oaks, California. As one of the world's largest independent biotechnology companies, Amgen has a ...
, and Lee Ellis, a medical researcher at the University of Texas, found that only 11% of 53 pre-clinical cancer studies had replications that could confirm conclusions from the original studies. In late 2021, The Reproducibility Project: Cancer Biology examined 53 top papers about cancer published between 2010 and 2012 and showed that among studies that provided sufficient information to be redone, the effect sizes were 85% smaller on average than the original findings. A survey of cancer researchers found that half of them had been unable to reproduce a published result. Another report estimated that almost half of randomized controlled trials contained flawed data (based on the analysis of anonymized individual participant data (IPD) from more than 150 trials).


In other disciplines


In nutrition science

In nutrition science, for most food ingredients, there were studies that found that the ingredient has an effect on cancer risk. Specifically, out of a random sample of 50 ingredients from a cookbook, 80% had articles reporting on their cancer risk. Statistical significance decreased for meta-analyses.


In economics

Economics Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goods and services. Economics focuses on the behaviour and interac ...
has lagged behind other social sciences and psychology in its attempts to assess replication rates and increase the number of studies that attempt replication. A 2016 study in the journal ''Science'' replicated 18 experimental studies published in two leading economics journals, ''
The American Economic Review The ''American Economic Review'' is a monthly peer-reviewed academic journal first published by the American Economic Association in 1911. The current editor-in-chief is Erzo FP Luttmer, a professor of economics at Dartmouth College. The journal i ...
'' and the ''
Quarterly Journal of Economics ''The Quarterly Journal of Economics'' is a peer-reviewed academic journal published by the Oxford University Press for the Harvard University Department of Economics. Its current editors-in-chief are Robert J. Barro, Lawrence F. Katz, Nathan ...
'', between 2011 and 2014. It found that about 39% failed to reproduce the original results. About 20% of studies published in ''The American Economic Review'' are contradicted by other studies despite relying on the same or similar data sets. A study of empirical findings in the ''
Strategic Management Journal The Strategic Management Society (SMS) is a professional society for the advancement of strategic management. The society consists of nearly 3,000 members representing various backgrounds and perspectives from more than eighty different countries ...
'' found that about 30% of 27 retested articles showed statistically insignificant results for previously significant findings, whereas about 4% showed statistically significant results for previously insignificant findings.


In water resource management

A 2019 study in '' Scientific Data'' estimated with 95% confidence that of 1,989 articles on water resources and management published in 2017, study results might be reproduced for only 0.6% to 6.8%, largely because the articles did not provide sufficient information to allow for replication.


Across fields

A 2016 survey by ''Nature'' on 1,576 researchers who took a brief online questionnaire on reproducibility found that more than 70% of researchers have tried and failed to reproduce another scientist's experiment results (including 87% of
chemist A chemist (from Greek ''chēm(ía)'' alchemy; replacing ''chymist'' from Medieval Latin ''alchemist'') is a graduated scientist trained in the study of chemistry, or an officially enrolled student in the field. Chemists study the composition of ...
s, 77% of
biologist A biologist is a scientist who conducts research in biology. Biologists are interested in studying life on Earth, whether it is an individual Cell (biology), cell, a multicellular organism, or a Community (ecology), community of Biological inter ...
s, 69% of
physicist A physicist is a scientist who specializes in the field of physics, which encompasses the interactions of matter and energy at all length and time scales in the physical universe. Physicists generally are interested in the root or ultimate cau ...
s and
engineer Engineers, as practitioners of engineering, are professionals who Invention, invent, design, build, maintain and test machines, complex systems, structures, gadgets and materials. They aim to fulfill functional objectives and requirements while ...
s, 67% of medical researchers, 64% of
earth Earth is the third planet from the Sun and the only astronomical object known to Planetary habitability, harbor life. This is enabled by Earth being an ocean world, the only one in the Solar System sustaining liquid surface water. Almost all ...
and environmental scientists, and 62% of all others), and more than half have failed to reproduce their own experiments. But fewer than 20% had been contacted by another researcher unable to reproduce their work. The survey found that fewer than 31% of researchers believe that failure to reproduce results means that the original result is probably wrong, although 52% agree that a significant replication crisis exists. Most researchers said they still trust the published literature. In 2010, Fanelli (2010) found that 91.5% of psychiatry/psychology studies confirmed the effects they were looking for, and concluded that the odds of this happening (a positive result) was around five times higher than in fields such as
astronomy Astronomy is a natural science that studies celestial objects and the phenomena that occur in the cosmos. It uses mathematics, physics, and chemistry in order to explain their origin and their overall evolution. Objects of interest includ ...
or
geoscience Earth science or geoscience includes all fields of natural science related to the planet Earth. This is a branch of science dealing with the physical, chemical, and biological complex constitutions and synergistic linkages of Earth's four spheres ...
s. Fanelli argued that this is because researchers in "softer" sciences have fewer constraints to their conscious and unconscious biases. Early analysis of result-blind peer review, which is less affected by publication bias, has estimated that 61% of result-blind studies in biomedicine and psychology have led to null results, in contrast to an estimated 5% to 20% in earlier research. In 2021, a study conducted by
University of California, San Diego The University of California, San Diego (UC San Diego in communications material, formerly and colloquially UCSD) is a public university, public Land-grant university, land-grant research university in San Diego, California, United States. Es ...
found that papers that cannot be replicated are more likely to be cited. Nonreplicable publications are often cited more even after a replication study is published.


Causes

There are many proposed causes for the replication crisis.


Historical and sociological causes

The replication crisis may be triggered by the "generation of new data and scientific publications at an unprecedented rate" that leads to "desperation to publish or perish" and failure to adhere to good scientific practice. Predictions of an impending crisis in the quality-control mechanism of science can be traced back several decades.
Derek de Solla Price Derek John de Solla Price (22 January 1922 – 3 September 1983) was a British physicist, historian of science, and information scientist. He was known for his investigation of the Antikythera mechanism, an ancient Greek planetary computer, and ...
—considered the father of
scientometrics Scientometrics is a subfield of informetrics that studies quantitative aspects of scholarly literature. Major research issues include the measurement of the impact of research papers and academic journals, the understanding of scientific citati ...
, the quantitative study of science—predicted in 1963 that science could reach "senility" as a result of its own exponential growth. Some present-day literature seems to vindicate this "overflow" prophecy, lamenting the decay in both attention and quality. Historian Philip Mirowski argues that the decline of scientific quality can be connected to its commodification, especially spurred by major corporations' profit-driven decision to outsource their research to universities and
contract research organization In the life sciences, a contract research organization (CRO) is a company that provides support to the pharmaceutical, biotechnology, and medical device industries in the form of research services outsourced on a contract basis. A CRO may provid ...
s. Social
systems theory Systems theory is the Transdisciplinarity, transdisciplinary study of systems, i.e. cohesive groups of interrelated, interdependent components that can be natural or artificial. Every system has causal boundaries, is influenced by its context, de ...
, as expounded in the work of German sociologist
Niklas Luhmann Niklas Luhmann (; ; December 8, 1927 – November 11, 1998) was a German sociologist, philosopher of social science, and systems theorist. Niklas Luhmann is one of the most influential German sociologists of the 20th century. His thinking was ...
, inspires a similar diagnosis. This theory holds that each system, such as economy, science, religion, and media, communicates using its own code: ''true'' and ''false'' for science, ''profit'' and ''loss'' for the economy, ''news'' and ''no-news'' for the media, and so on. According to some sociologists, science's mediatization, commodification, and politicization, as a result of the structural coupling among systems, have led to a confusion of the original system codes.


Problems with the publication system in science


Publication bias

A major cause of low reproducibility is the
publication bias In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a Statistical significance, significant find ...
stemming from the fact that statistically non-significant results and seemingly unoriginal replications are rarely published. Only a very small proportion of academic journals in psychology and neurosciences explicitly welcomed submissions of replication studies in their aim and scope or instructions to authors. This does not encourage reporting on, or even attempts to perform, replication studies. Among 1,576 researchers ''Nature'' surveyed in 2016, only a minority had ever attempted to publish a replication, and several respondents who had published failed replications noted that editors and reviewers demanded that they play down comparisons with the original studies. An analysis of 4,270 empirical studies in 18 business journals from 1970 to 1991 reported that less than 10% of accounting, economics, and finance articles and 5% of management and marketing articles were replication studies. Publication bias is augmented by the pressure to publish and the author's own
confirmation bias Confirmation bias (also confirmatory bias, myside bias, or congeniality bias) is the tendency to search for, interpret, favor and recall information in a way that confirms or supports one's prior beliefs or Value (ethics and social sciences), val ...
, and is an inherent hazard in the field, requiring a certain degree of skepticism on the part of readers. Publication bias leads to what psychologist Robert Rosenthal calls the " file drawer effect". The file drawer effect is the idea that as a consequence of the publication bias, a significant number of negative results are not published. According to philosopher of science Felipe Romero, this tends to produce "misleading literature and biased meta-analytic studies", and when publication bias is considered along with the fact that a majority of tested hypotheses might be false ''a priori'', it is plausible that a considerable proportion of research findings might be false positives, as shown by metascientist John Ioannidis. In turn, a high proportion of false positives in the published literature can explain why many findings are nonreproducible. Another publication bias is that studies that do not reject the null hypothesis are scrutinized asymmetrically. For example, they are likely to be rejected as being difficult to interpret or having a Type II error. Studies that do reject the null hypothesis are not likely to be rejected for those reasons. In popular media, there is another element of publication bias: the desire to make research accessible to the public led to oversimplification and exaggeration of findings, creating unrealistic expectations and amplifying the impact of non-replications. In contrast, null results and failures to replicate tend to go unreported. This explanation may apply to power posing's replication crisis.


Mathematical errors

Even high-impact journals have a significant fraction of mathematical errors in their use of statistics. For example, 11% of statistical results published in ''Nature'' and ''BMJ'' in 2001 are "incongruent", meaning that the reported p-value is mathematically different from what it should be if it were correctly calculated from the reported test statistic. These errors were likely from typesetting, rounding, and transcription errors. Among 157 neuroscience papers published in five top-ranking journals that attempt to show that two experimental effects are different, 78 erroneously tested instead for whether one effect is significant while the other is not, and 79 correctly tested for whether their difference is significantly different from 0.


"Publish or perish" culture

The consequences for replicability of the publication bias are exacerbated by academia's "publish or perish" culture. As explained by metascientist Daniele Fanelli, "publish or perish" culture is a sociological aspect of academia whereby scientists work in an environment with very high pressure to have their work published in recognized journals. This is the consequence of the academic work environment being hypercompetitive and of bibliometric parameters (e.g., number of publications) being increasingly used to evaluate scientific careers. According to Fanelli, this pushes scientists to employ a number of strategies aimed at making results "publishable". In the context of publication bias, this can mean adopting behaviors aimed at making results positive or statistically significant, often at the expense of their validity (see QRPs, section 4.3). According to Center for Open Science founder Brian Nosek and his colleagues, "publish or perish" culture created a situation whereby the goals and values of single scientists (e.g., publishability) are not aligned with the general goals of science (e.g., pursuing scientific truth). This is detrimental to the validity of published findings. Philosopher Brian D. Earp and psychologist Jim A. C. Everett argue that, although replication is in the best interests of academics and researchers as a group, features of academic psychological culture discourage replication by individual researchers. They argue that performing replications can be time-consuming, and take away resources from projects that reflect the researcher's original thinking. They are harder to publish, largely because they are unoriginal, and even when they can be published they are unlikely to be viewed as major contributions to the field. Replications "bring less recognition and reward, including grant money, to their authors". In his 1971 book '' Scientific Knowledge and Its Social Problems'', philosopher and historian of science Jerome R. Ravetz predicted that science—in its progression from "little" science composed of isolated communities of researchers to "big" science or "techno-science"—would suffer major problems in its internal system of quality control. He recognized that the incentive structure for modern scientists could become dysfunctional, creating
perverse incentives The phrase "perverse incentive" is often used in economics to describe an incentive structure with undesirable results, particularly when those effects are unexpected and contrary to the intentions of its designers. The results of a perverse in ...
to publish any findings, however dubious. According to Ravetz, quality in science is maintained only when there is a community of scholars, linked by a set of shared norms and standards, who are willing and able to hold each other accountable.


Standards of reporting

Certain publishing practices also make it difficult to conduct replications and to monitor the severity of the reproducibility crisis, for articles often come with insufficient descriptions for other scholars to reproduce the study. The Reproducibility Project: Cancer Biology showed that of 193 experiments from 53 top papers about cancer published between 2010 and 2012, only 50 experiments from 23 papers have authors who provided enough information for researchers to redo the studies, sometimes with modifications. None of the 193 papers examined had its experimental protocols fully described and replicating 70% of experiments required asking for key reagents. The aforementioned study of empirical findings in the ''
Strategic Management Journal The Strategic Management Society (SMS) is a professional society for the advancement of strategic management. The society consists of nearly 3,000 members representing various backgrounds and perspectives from more than eighty different countries ...
'' found that 70% of 88 articles could not be replicated due to a lack of sufficient information for data or procedures. In
water resources Water resources are natural resources of water that are potentially useful for humans, for example as a source of drinking water supply or irrigation water. These resources can be either Fresh water, freshwater from natural sources, or water produ ...
and
management Management (or managing) is the administration of organizations, whether businesses, nonprofit organizations, or a Government agency, government bodies through business administration, Nonprofit studies, nonprofit management, or the political s ...
, most of 1,987 articles published in 2017 were not replicable because of a lack of available information shared online. In studies of
event-related potential An event-related potential (ERP) is the measured brain response that is the direct result of a specific sense, sensory, cognition, cognitive, or motor system, motor event. More formally, it is any stereotyped electrophysiology, electrophysiologi ...
s, only two-thirds the information needed to replicate a study were reported in a sample of 150 studies, highlighting that there are substantial gaps in reporting.


Procedural bias

By the Duhem-Quine thesis, scientific results are interpreted by both a substantive theory and a theory of instruments. For example, astronomical observations depend both on the theory of astronomical objects and the theory of telescopes. A large amount of non-replicable research might accumulate if there is a bias of the following kind: faced with a null result, a scientist prefers to treat the data as saying the instrument is insufficient; faced with a non-null result, a scientist prefers to accept the instrument as good, and treat the data as saying something about the substantive theory.


Cultural evolution

Smaldino and McElreath proposed a simple model for the
cultural evolution Cultural evolution is an evolutionary theory of social change. It follows from the definition of culture as "information capable of affecting individuals' behavior that they acquire from other members of their species through teaching, imitation ...
of scientific practice. Each lab randomly decides to produce novel research or replication research, at different fixed levels of false positive rate, true positive rate, replication rate, and productivity (its "traits"). A lab might use more "effort", making the ROC curve more convex but decreasing productivity. A lab accumulates a score over its lifetime that increases with publications and decreases when another lab fails to replicate its results. At regular intervals, a random lab "dies" and another "reproduces" a child lab with a similar trait as its parent. Labs with higher scores are more likely to reproduce. Under certain parameter settings, the population of labs converge to maximum productivity even at the price of very high false positive rates.


Questionable research practices and fraud

Questionable research practices (QRPs) are intentional behaviors that capitalize on the gray area of acceptable scientific behavior or exploit the
researcher degrees of freedom Researcher degrees of freedom is a concept referring to the inherent flexibility involved in the process of designing and conducting a scientific experiment, and in analyzing its results. The term reflects the fact that researchers can choose betw ...
(researcher DF), which can contribute to the irreproducibility of results by increasing the probability of false positive results. Researcher DF are seen in
hypothesis A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educated guess o ...
formulation,
design of experiments The design of experiments (DOE), also known as experiment design or experimental design, is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. ...
,
data collection Data collection or data gathering is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. Data collection is a research com ...
and
analysis Analysis (: analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (38 ...
, and reporting of research. But in many analyst studies involving several researchers or research teams analyzing the same data, analysts obtain different and sometimes conflicting results, even without incentives to report statistically significant findings across psychology, linguistics, and ecology. This is because research design and data analysis entail numerous decisions that are not sufficiently constrained by a field’s best practices and statistical methodologies. As a result, researcher DF can lead to situations where some failed replication attempts use a different, yet plausible, research design or statistical analysis; such studies do not necessarily undermine previous findings.
Multiverse analysis Multiverse analysis is a scientific method that specifies and then runs a set of plausible alternative models or statistical tests for a single hypothesis. It is a method to address the issue that the "scientific process confronts researchers with ...
, a method that makes inferences based on all plausible data-processing pipelines, provides a solution to the problem of analytical flexibility. Instead, estimating many statistical models (known as
data dredging Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Th ...
), selective reporting only statistically significant findings, and
HARKing HARKing (hypothesizing after the results are known) is an acronym coined by social psychologist Norbert Kerr that refers to the questionable research practice of "presenting a post hoc hypothesis in the introduction of a research report as if it w ...
(hypothesizing after results are known) are examples of questionable research practices. In medicine, irreproducible studies have six features in common: investigators not being blinded to the experimental versus the control arms; failure to repeat experiments; lack of positive and negative controls; failing to report all the data; inappropriate use of statistical tests; and use of reagents that were not appropriately
validate Validation may refer to: * Data validation, in computer science, ensuring that data inserted into an application satisfies defined formats and other input criteria * Emotional validation, in interpersonal communication is the Emotion recognition, ...
d. QRPs do not include more explicit violations of scientific integrity, such as data falsification. Fraudulent research does occur, as in the case of scientific fraud by social psychologist Diederik Stapel, cognitive psychologist Marc Hauser and social psychologist Lawrence Sanna, but it appears to be uncommon.


Prevalence

According to IU professor Ernest O’Boyle and psychologist Martin Götz, around 50% of researchers surveyed across various studies admitted engaging in HARKing. In a survey of 2,000 psychologists by behavioral scientist Leslie K. John and colleagues, around 94% of psychologists admitted having employed at least one QRP. More specifically, 63% admitted failing to report all of a study's dependent measures, 28% to report all of a study's conditions, and 46% to selectively reporting studies that produced the desired pattern of results. In addition, 56% admitted having collected more data after having inspected already collected data, and 16% to having stopped data collection because the desired result was already visible. According to biotechnology researcher J. Leslie Glick's estimate in 1992, 10% to 20% of research and development studies involved either QRPs or outright fraud. The methodology used to estimate QRPs has been contested, and more recent studies suggested lower prevalence rates on average. A 2009 meta-analysis found that 2% of scientists across fields admitted falsifying studies at least once and 14% admitted knowing someone who did. Such misconduct was, according to one study, reported more frequently by medical researchers than by others.


Statistical issues


Low statistical power

According to
Deakin University Deakin University is a public university in Victoria, Australia. Founded in 1974 with antecedent history since 1887, the university was named after Alfred Deakin, the second Prime Minister of Australia and a founding father of Australian Fede ...
professor Tom Stanley and colleagues, one plausible reason studies fail to replicate is low
statistical power In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...
. This happens for three reasons. First, a replication study with low power is unlikely to succeed since, by definition, it has a low probability to detect a true effect. Second, if the original study has low power, it will yield biased
effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
estimates. When conducting a priori power analysis for the replication study, this will result in underestimation of the required sample size. Third, if the original study has low power, the post-study odds of a statistically significant finding reflecting a true effect are quite low. It is therefore likely that a replication attempt of the original study would fail. Mathematically, the probability of replicating a previous publication that rejected a null hypothesis H_0 in favor of an alternative H_1 is (\text) Pr(H_0 , \text ) + (\text) Pr(H_1 , \text ) \leq (\text)assuming significance is less than power. Thus, low power implies low probability of replication, regardless of how the previous publication was designed, and regardless of which hypothesis is really true. Stanley and colleagues estimated the average statistical power of psychological literature by analyzing data from 200
meta-analyses Meta-analysis is a method of synthesis of quantitative data from multiple independent studies addressing a common research question. An important part of this method involves computing a combined effect size across all of the studies. As such, th ...
. They found that on average, psychology studies have between 33.1% and 36.4% statistical power. These values are quite low compared to the 80% considered adequate statistical power for an experiment. Across the 200 meta-analyses, the median of studies with adequate statistical power was between 7.7% and 9.1%, implying that a positive result would replicate with probability less than 10%, regardless of whether the positive result was a true positive or a false positive. The statistical power of
neuroscience Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions, and its disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, ...
studies is quite low. The estimated statistical power of
fMRI Functional magnetic resonance imaging or functional MRI (fMRI) measures brain activity by detecting changes associated with blood flow. This technique relies on the fact that cerebral blood flow and neuronal activation are coupled. When an area o ...
research is between .08 and .31, and that of studies of
event-related potential An event-related potential (ERP) is the measured brain response that is the direct result of a specific sense, sensory, cognition, cognitive, or motor system, motor event. More formally, it is any stereotyped electrophysiology, electrophysiologi ...
s was estimated as .72‒.98 for large effect sizes, .35‒.73 for medium effects, and .10‒.18 for small effects. In a study published in ''Nature'', psychologist Katherine Button and colleagues conducted a similar study with 49 meta-analyses in neuroscience, estimating a median statistical power of 21%. Meta-scientist
John Ioannidis John P. A. Ioannidis ( ; , ; born August 21, 1965) is a Greek-American physician-scientist, writer and Stanford University professor who has made contributions to evidence-based medicine, epidemiology, and clinical research. Ioannidis studies sc ...
and colleagues computed an estimate of average power for empirical economic research, finding a median power of 18% based on literature drawing upon 6.700 studies. In light of these results, it is plausible that a major reason for widespread failures to replicate in several scientific fields might be very low statistical power on average. The same statistical test with the same significance level will have lower statistical power if the effect size is small under the alternative hypothesis. Complex inheritable traits are typically correlated with a large number of genes, each of small effect size, so high power requires a large sample size. In particular, many results from the
candidate gene The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest, and Phenotype (clinical medicine), phenotypes or disease states. This is in contrast to ...
literature suffered from small effect sizes and small sample sizes and would not replicate. More data from genome-wide association studies (GWAS) come close to solving this problem. As a numeric example, most genes associated with schizophrenia risk have low effect size (genotypic relative risk, GRR). A statistical study with 1000 cases and 1000 controls has 0.03% power for a gene with GRR = 1.15, which is already large for schizophrenia. In contrast, the largest GWAS to date has ~100% power for it.


Positive effect size bias

Even when the study replicates, the replication typically have smaller effect size. Underpowered studies have a large effect size bias. In studies that statistically estimate a regression factor, such as the k in Y = kX + b, when the dataset is large, noise tends to cause the regression factor to be underestimated, but when the dataset is small, noise tends to cause the regression factor to be overestimated.


Problems of meta-analysis

Meta-analyses have their own methodological problems and disputes, which leads to rejection of the meta-analytic method by researchers whose theory is challenged by meta-analysis. Rosenthal proposed the "fail-safe number" (FSN) to avoid the publication bias against null results. It is defined as follows: Suppose the null hypothesis is true; how many publications would be required to make the current result indistinguishable from the null hypothesis? Rosenthal's point is that certain effect sizes are large enough, such that even if there is a total publication bias against null results (the "file drawer problem"), the number of unpublished null results would be impossibly large to swamp out the effect size. Thus, the effect size must be statistically significant even after accounting for unpublished null results. One objection to the FSN is that it is calculated as if unpublished results are unbiased samples from the null hypothesis. But if the file drawer problem is true, then unpublished results would have effect sizes concentrated around 0. Thus fewer unpublished null results would be necessary to swap out the effect size, and so the FSN is an overestimate. Another problem with meta-analysis is that bad studies are "infectious" in the sense that one bad study might cause the entire meta-analysis to overestimate statistical significance.


P-hacking

Various statistical methods can be applied to make the p-value appear smaller than it really is. This need not be malicious, as moderately flexible data analysis, routine in research, can increase the false-positive rate to above 60%. For example, if one collects some data, applies several different significance tests to it, and publishes only the one that happens to have a p-value less than 0.05, then the total p-value for "at least one significance test reaches p < 0.05" can be much larger than 0.05, because even if the null hypothesis were true, the probability that one out of many significance tests is extreme is not itself extreme. Typically, a statistical study has multiple steps, with several choices at each step, such as during data collection, outlier rejection, choice of test statistic, choice of one-tailed or two-tailed test, etc. These choices in the " garden of forking paths" multiply, creating many "researcher degrees of freedom". The effect is similar to the file-drawer problem, as the paths not taken are not published. Consider a simple illustration. Suppose the null hypothesis is true, and we have 20 possible significance tests to apply to the dataset. Also suppose the outcomes to the significance tests are independent. By definition of "significance", each test has probability 0.05 to pass with significance level 0.05. The probability that at least 1 out of 20 is significant is, by assumption of independence, 1 - (1 - 0.05)^ = 0.64. Another possibility is the
multiple comparisons problem Multiple comparisons, multiplicity or multiple testing problem occurs in statistics when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values. The larger the numbe ...
. In 2009, it was twice noted that fMRI studies had a suspicious number of positive results with large effect sizes, more than would be expected since the studies have low power (one example had only 13 subjects). It pointed out that over half of the studies would test for correlation between a phenomenon and individual fMRI voxels, and only report on voxels exceeding chosen thresholds. Optional stopping is a practice where one collects data until some stopping criterion is reached. Though a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collecting even more data before stopping. Neglecting these events leads to a p-value that is too low. In fact, if the null hypothesis is true, ''any'' significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained. For a concrete example of testing for a fair coin, see ''p''-value#optional stopping. More succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter ''could'' have done in reaction to data that ''might'' have been. Accounting for what might have been is hard even for honest researchers. One benefit of preregistration is to account for all counterfactuals, allowing the p-value to be calculated correctly. The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway. Such practices are widespread in psychology. In a 2012 survey, 56% of psychologists admitted to early stopping, 46% to only reporting analyses that "worked", and 38% to ''post hoc'' exclusion, that is, removing some data ''after'' analysis was already performed on the data before reanalyzing the remaining data (often on the premise of "outlier removal").


Statistical heterogeneity

As also reported by Stanley and colleagues, a further reason studies might fail to replicate is high
heterogeneity Homogeneity and heterogeneity are concepts relating to the uniformity of a substance, process or image. A homogeneous feature is uniform in composition or character (i.e., color, shape, size, weight, height, distribution, texture, language, i ...
of the to-be-replicated effects. In meta-analysis, "heterogeneity" refers to the variance in research findings that results from there being no single true effect size. Instead, findings in such cases are better seen as a distribution of true effects. Statistical heterogeneity is calculated using the I-squared statistic, defined as "the proportion (or percentage) of observed variation among reported effect sizes that cannot be explained by the calculated standard errors associated with these reported effect sizes". This variation can be due to differences in experimental methods, populations, cohorts, and statistical methods between replication studies. Heterogeneity poses a challenge to studies attempting to replicate previously found
effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
s. When heterogeneity is high, subsequent replications have a high probability of finding an effect size radically different than that of the original study. Importantly, significant levels of heterogeneity are also found in direct/exact replications of a study. Stanley and colleagues discuss this while reporting a study by quantitative behavioral scientist Richard Klein and colleagues, where the authors attempted to replicate 15 psychological effects across 36 different sites in Europe and the U.S. In the study, Klein and colleagues found significant amounts of heterogeneity in 8 out of 16 effects (I-squared = 23% to 91%). Importantly, while the replication sites intentionally differed on a variety of characteristics, such differences could account for very little heterogeneity . According to Stanley and colleagues, this suggested that heterogeneity could have been a genuine characteristic of the phenomena being investigated. For instance, phenomena might be influenced by so-called "hidden moderators" – relevant factors that were previously not understood to be important in the production of a certain effect. In their analysis of 200 meta-analyses of psychological effects, Stanley and colleagues found a median percent of heterogeneity of I-squared = 74%. According to the authors, this level of heterogeneity can be considered "huge". It is three times larger than the random sampling variance of effect sizes measured in their study. If considered along
sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ...
, heterogeneity yields a
standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
from one study to the next even larger than the median effect size of the 200 meta-analyses they investigated. The authors conclude that if replication is defined by a subsequent study finding a sufficiently similar effect size to the original, replication success is not likely even if replications have very large sample sizes. Importantly, this occurs even if replications are direct or exact since heterogeneity nonetheless remains relatively high in these cases.


Others

Within economics, the replication crisis may be also exacerbated because econometric results are fragile: using different but plausible estimation procedures or data preprocessing techniques can lead to conflicting results.


Context sensitivity

New York University New York University (NYU) is a private university, private research university in New York City, New York, United States. Chartered in 1831 by the New York State Legislature, NYU was founded in 1832 by Albert Gallatin as a Nondenominational ...
professor Jay Van Bavel and colleagues argue that a further reason findings are difficult to replicate is the sensitivity to context of certain psychological effects. On this view, failures to replicate might be explained by contextual differences between the original experiment and the replication, often called "hidden moderators". Van Bavel and colleagues tested the influence of context sensitivity by reanalyzing the data of the widely cited Reproducibility Project carried out by the Open Science Collaboration. They re-coded effects according to their sensitivity to contextual factors and then tested the relationship between context sensitivity and replication success in various
regression models Regression or regressions may refer to: Arts and entertainment * ''Regression'' (film), a 2015 horror film by Alejandro Amenábar, starring Ethan Hawke and Emma Watson * ''Regression'' (magazine), an Australian punk rock fanzine (1982–1984) * ...
. Context sensitivity was found to negatively correlate with replication success, such that higher ratings of context sensitivity were associated with lower probabilities of replicating an effect. Importantly, context sensitivity significantly correlated with replication success even when adjusting for other factors considered important for reproducing results (e.g., effect size and sample size of original, statistical power of the replication, methodological similarity between original and replication). In light of the results, the authors concluded that attempting a replication in a different time, place or with a different sample can significantly alter an experiment's results. Context sensitivity thus may be a reason certain effects fail to replicate in psychology.


Bayesian explanation

In the framework of Bayesian probability, by
Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...
, rejecting the null hypothesis at significance level 5% does not mean that the posterior probability for the alternative hypothesis is 95%, and the posterior probability is also different from the probability of replication. Consider a simplified case where there are only two hypotheses. Let the prior probability of the null hypothesis be Pr(H_0), and the alternative Pr(H_1) = 1 - Pr(H_0). For a given statistical study, let its false positive rate (significance level) be Pr(\text H_1 , H_0), and true positive rate (power) be Pr(\text H_1 , H_1). For illustrative purposes, let significance level be 0.05 and power be 0.45 (underpowered). Now, by Bayes' theorem, conditional on the statistical studying finding H_1 to be true, the posterior probability of H_1 actually being true is not 1 - Pr(\textH_1 , H_0) = 0.95 , but Pr(H_1 , \textH_1) = \frac and the probability of replicating the statistical study is Pr(\text, \text H_1) = Pr(\text H_1 , H_1) Pr(H_1 , \textH_1) + Pr(\text H_1 , H_0) Pr(H_0 , \textH_1) which is also different from Pr(H_1 , \textH_1) . In particular, for a fixed level of significance, the probability of replication increases with power, and prior probability for H_1 . If the prior probability for H_1 is small, then one would require a high power for replication. For example, if the prior probability of the null hypothesis is Pr(H_0) = 0.9 , and the study found a positive result, then the posterior probability for H_1 is Pr(H_1 , \textH_1) = 0.50 , and the replication probability is Pr(\text, \text H_1) = 0.25 .


Problem with null hypothesis testing

Some argue that null hypothesis testing is itself inappropriate, especially in "soft sciences" like social psychology. As repeatedly observed by statisticians, in complex systems, such as social psychology, "the null hypothesis is always false", or "everything is correlated". If so, then if the null hypothesis is not rejected, that does not show that the null hypothesis is true, but merely that it was a false negative, typically due to low power. Low power is especially prevalent in subject areas where effect sizes are small and data is expensive to acquire, such as social psychology. Furthermore, when the null hypothesis is rejected, it might not be evidence for the substantial alternative hypothesis. In soft sciences, many hypotheses can predict a correlation between two variables. Thus, evidence ''against'' the null hypothesis "there is no correlation" is no evidence ''for'' one of the many alternative hypotheses that equally well predict "there is a correlation". Fisher developed the NHST for agronomy, where rejecting the null hypothesis is usually good proof of the alternative hypothesis, since there are not many of them. Rejecting the hypothesis "fertilizer does not help" is evidence for "fertilizer helps". But in psychology, there are many alternative hypotheses for every null hypothesis.
Paul Meehl Paul Everett Meehl (3 January 1920 – 14 February 2003) was an American clinical psychologist. He was the Hathaway and Regents' Professor of Psychology at the University of Minnesota, and past president of the American Psychological Association ...
(1986).
What social scientists don't understand
'. In D. W. Fiske & R. A. Shweder (Eds.), ''Metatheory in social science: Pluralisms and subjectivities'' (pp. 315-338). Chicago: University of Chicago Press.
In particular, when statistical studies on extrasensory perception reject the null hypothesis at extremely low p-value (as in the case of
Daryl Bem Daryl J. Bem (born June 10, 1938) is a social psychologist and professor emeritus at Cornell University. He is the originator of the self-perception theory of attitude formation and change. He has also researched psi phenomena, group decision ma ...
), it does not imply the alternative hypothesis "ESP exists". Far more likely is that there was a small (non-ESP) signal in the experiment setup that has been measured precisely.
Paul Meehl Paul Everett Meehl (3 January 1920 – 14 February 2003) was an American clinical psychologist. He was the Hathaway and Regents' Professor of Psychology at the University of Minnesota, and past president of the American Psychological Association ...
noted that statistical hypothesis testing is used differently in "soft" psychology (personality, social, etc.) from physics. In physics, a theory makes a quantitative prediction and is tested by checking whether the prediction falls within the statistically measured interval. In soft psychology, a theory makes a directional prediction and is tested by checking whether the null hypothesis is rejected in the right direction. Consequently, improved experimental technique makes theories more likely to be falsified in physics but less likely to be falsified in soft psychology, as the null hypothesis is always false since any two variables are correlated by a "crud factor" of about 0.30. The net effect is an accumulation of theories that remain unfalsified, but with no empirical evidence for preferring one over the others.


Base rate fallacy

According to philosopher Alexander Bird, a possible reason for the low rates of replicability in certain scientific fields is that a majority of tested hypotheses are false ''a priori''. On this view, low rates of replicability could be consistent with quality science. Relatedly, the expectation that most findings should replicate would be misguided and, according to Bird, a form of base rate fallacy. Bird's argument works as follows. Assuming an ideal situation of a test of significance, whereby the probability of incorrectly rejecting the null hypothesis is 5% (i.e.
Type I error Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...
) and the probability of correctly rejecting the null hypothesis is 80% (i.e. Power), in a context where a high proportion of tested hypotheses are false, it is conceivable that the number of false positives would be high compared to those of true positives. For example, in a situation where only 10% of tested hypotheses are actually true, one can calculate that as many as 36% of results will be false positives. The claim that the falsity of most tested hypotheses can explain low rates of replicability is even more relevant when considering that the average power for statistical tests in certain fields might be much lower than 80%. For example, the proportion of false positives increases to a value between 55.2% and 57.6% when calculated with the estimates of an average power between 34.1% and 36.4% for psychology studies, as provided by Stanley and colleagues in their analysis of 200 meta-analyses in the field. A high proportion of false positives would then result in many research findings being non-replicable. Bird notes that the claim that a majority of tested hypotheses are false ''a priori'' in certain scientific fields might be plausible given factors such as the complexity of the phenomena under investigation, the fact that theories are seldom undisputed, the "inferential distance" between theories and hypotheses, and the ease with which hypotheses can be generated. In this respect, the fields Bird takes as examples are clinical medicine, genetic and molecular epidemiology, and social psychology. This situation is radically different in fields where theories have outstanding empirical basis and hypotheses can be easily derived from theories (e.g., experimental physics).


Consequences

When effects are wrongly stated as relevant in the literature, failure to detect this by replication will lead to the canonization of such false facts. A 2021 study found that papers in leading general interest, psychology and economics journals with findings that could not be replicated tend to be cited more over time than reproducible research papers, likely because these results are surprising or interesting. The trend is not affected by publication of failed reproductions, after which only 12% of papers that cite the original research will mention the failed replication. Further, experts are able to predict which studies will be replicable, leading the authors of the 2021 study, Marta Serra-Garcia and
Uri Gneezy Uri Hezkia Gneezy (Hebrew language, Hebrew: אורי גניזי; born June 6, 1967) is an Israeli-American behavioral economist, known for his work on incentives. He currently holds the Epstein/Atkinson Endowed Chair in Behavioral Economics at th ...
, to conclude that experts apply lower standards to interesting results when deciding whether to publish them.


Public awareness and perceptions

Concerns have been expressed within the scientific community that the general public may consider science less credible due to failed replications. Research supporting this concern is sparse, but a nationally representative survey in Germany showed that more than 75% of Germans have not heard of replication failures in science. The study also found that most Germans have positive perceptions of replication efforts: only 18% think that non-replicability shows that science cannot be trusted, while 65% think that replication research shows that science applies quality control, and 80% agree that errors and corrections are part of science.


Response in academia

With the replication crisis of psychology earning attention, Princeton University psychologist Susan Fiske drew controversy for speaking against critics of psychology for what she called bullying and undermining the science. She called these unidentified "adversaries" names such as "methodological terrorist" and "self-appointed data police", saying that criticism of psychology should be expressed only in private or by contacting the journals. Columbia University statistician and political scientist Andrew Gelman responded to Fiske, saying that she had found herself willing to tolerate the "dead paradigm" of faulty statistics and had refused to retract publications even when errors were pointed out. He added that her tenure as editor had been abysmal and that a number of published papers she edited were found to be based on extremely weak statistics; one of Fiske's own published papers had a major statistical error and "impossible" conclusions.


Credibility revolution

Some researchers in
psychology Psychology is the scientific study of mind and behavior. Its subject matter includes the behavior of humans and nonhumans, both consciousness, conscious and Unconscious mind, unconscious phenomena, and mental processes such as thoughts, feel ...
indicate that the replication crisis is a foundation for a "credibility revolution", where changes in standards by which psychological science are evaluated may include emphasizing transparency and openness, preregistering research projects, and replicating research with higher standards for evidence to improve the strength of scientific claims. Such changes may diminish the productivity of individual researchers, but this effect could be avoided by data sharing and greater collaboration. A credibility revolution could be good for the research environment.


Remedies

Focus on the replication crisis has led to renewed efforts in psychology to retest important findings. A 2013 special edition of the journal ''
Social Psychology Social psychology is the methodical study of how thoughts, feelings, and behaviors are influenced by the actual, imagined, or implied presence of others. Although studying many of the same substantive topics as its counterpart in the field ...
'' focused on replication studies.
Standardization Standardization (American English) or standardisation (British English) is the process of implementing and developing technical standards based on the consensus of different parties that include firms, users, interest groups, standards organiza ...
as well as (requiring) transparency of the used statistical and experimental methods have been proposed. Careful
documentation Documentation is any communicable material that is used to describe, explain or instruct regarding some attributes of an object, system or procedure, such as its parts, assembly, installation, maintenance, and use. As a form of knowledge managem ...
of the experimental set-up is considered crucial for replicability of experiments and various variables may not be documented and standardized such as animals' diets in animal studies. A 2016 article by
John Ioannidis John P. A. Ioannidis ( ; , ; born August 21, 1965) is a Greek-American physician-scientist, writer and Stanford University professor who has made contributions to evidence-based medicine, epidemiology, and clinical research. Ioannidis studies sc ...
elaborated on "Why Most Clinical Research Is Not Useful". Ioannidis describes what he views as some of the problems and calls for reform, characterizing certain points for medical research to be useful again; one example he makes is the need for medicine to be patient-centered (e.g. in the form of the
Patient-Centered Outcomes Research Institute The Patient-Centered Outcomes Research Institute (PCORI) is a United States–based non-profit institute created through the 2010 Patient Protection and Affordable Care Act. It is a government-sponsored organization charged with funding Compar ...
) instead of the current practice to mainly take care of "the needs of physicians, investigators, or sponsors".


Reform in scientific publishing


Metascience

Metascience is the use of
scientific methodology Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...
to study science itself. It seeks to increase the quality of scientific research while reducing waste. It is also known as "research on research" and "the science of science", as it uses
research methods Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to ...
to study how
research Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to ...
is done and where improvements can be made. Metascience is concerned with all fields of research and has been called "a bird's eye view of science." In Ioannidis's words, "Science is the best thing that has happened to human beings ... but we can do it better." Meta-research continues to be conducted to identify the roots of the crisis and to address them. Methods of addressing the crisis include pre-registration of scientific studies and
clinical trials Clinical trials are prospective biomedical or behavioral research studies on human subject research, human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel v ...
as well as the founding of organizations such as
CONSORT __NOTOC__ Consort may refer to: Music * "The Consort" (Rufus Wainwright song), from the 2000 album ''Poses'' * Consort of instruments, term for instrumental ensembles * Consort song (musical), a characteristic English song form, late 16th–earl ...
and the
EQUATOR Network The Enhancing the Quality and Transparency of health research Network (EQUATOR Network) is an international initiative aimed at promoting transparent and accurate reporting of health research studies to enhance the value and reliability of medic ...
that issue guidelines for methodology and reporting. Efforts continue to reform the system of academic incentives, improve the
peer review Peer review is the evaluation of work by one or more people with similar competencies as the producers of the work (:wiktionary:peer#Etymology 2, peers). It functions as a form of self-regulation by qualified members of a profession within the ...
process, reduce the
misuse of statistics Statistics, when used in a misleading fashion, can trick the casual observer into believing something other than what the data shows. That is, a misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misu ...
, combat bias in scientific literature, and increase the overall quality and efficiency of the scientific process.


Presentation of methodology

Some authors have argued that the insufficient communication of experimental methods is a major contributor to the reproducibility crisis and that better reporting of experimental design and statistical analyses would improve the situation. These authors tend to plead for both a broad cultural change in the scientific community of how statistics are considered and a more coercive push from scientific journals and funding bodies. But concerns have been raised about the potential for standards for transparency and replication to be misapplied to qualitative as well as quantitative studies. Business and management journals that have introduced editorial policies on data accessibility, replication, and transparency include the ''
Strategic Management Journal The Strategic Management Society (SMS) is a professional society for the advancement of strategic management. The society consists of nearly 3,000 members representing various backgrounds and perspectives from more than eighty different countries ...
'', the '' Journal of International Business Studies'', and the '' Management and Organization Review''.


Result-blind peer review

In response to concerns in psychology about publication bias and
data dredging Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Th ...
, more than 140 psychology journals have adopted result-blind peer review. In this approach, studies are accepted not on the basis of their findings and after the studies are completed, but before they are conducted and on the basis of the methodological rigor of their experimental designs, and the theoretical justifications for their statistical analysis techniques before data collection or analysis is done. Early analysis of this procedure has estimated that 61% of result-blind studies have led to null results, in contrast to an estimated 5% to 20% in earlier research. In addition, large-scale collaborations between researchers working in multiple labs in different countries that regularly make their data openly available for different researchers to assess have become much more common in psychology.


Pre-registration of studies

Scientific publishing has begun using pre-registration reports to address the replication crisis. The registered report format requires authors to submit a description of the study methods and analyses prior to data collection. Once the method and analysis plan is vetted through peer-review, publication of the findings is provisionally guaranteed, based on whether the authors follow the proposed protocol. One goal of registered reports is to circumvent the
publication bias In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a Statistical significance, significant find ...
toward significant findings that can lead to implementation of questionable research practices. Another is to encourage publication of studies with rigorous methods. The journal ''
Psychological Science ''Psychological Science'', the flagship journal of the Association for Psychological Science, is a monthly, peer-reviewed scientific journal published by SAGE Publications. The journal publishes research articles, short reports, and research repor ...
'' has encouraged the preregistration of studies and the reporting of effect sizes and confidence intervals. The editor in chief also noted that the editorial staff will be asking for replication of studies with surprising findings from examinations using small sample sizes before allowing the manuscripts to be published.


Metadata and digital tools for tracking replications

It has been suggested that "a simple way to check how often studies have been repeated, and whether or not the original findings are confirmed" is needed. Categorizations and ratings of reproducibility at the study or results level, as well as addition of links to and rating of third-party confirmations, could be conducted by the peer-reviewers, the scientific journal, or by readers in combination with novel digital platforms or tools.


Statistical reform


Requiring smaller ''p''-values

Many publications require a ''p''-value of ''p'' < 0.05 to claim
statistical significance In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...
. The paper "Redefine statistical significance", signed by a large number of scientists and mathematicians, proposes that in "fields where the threshold for defining statistical significance for new discoveries is ''p'' < 0.05, we propose a change to ''p'' < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields." Their rationale is that "a leading cause of non-reproducibility (is that the) statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. Associating 'statistically significant' findings with ''p'' < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems." This call was subsequently criticised by another large group, who argued that "redefining" the threshold would not fix current problems, would lead to some new ones, and that in the end, all thresholds needed to be justified case-by-case instead of following general conventions. A 2022 followup study examined these competing recommendations' practical impact. Despite high citation rates of both proposals, researchers found limited implementation of either the p < 0.005 threshold or the case-by-case justification approach in practice. This revealed what the authors called a "vicious cycle", in which scientists reject recommendations because they are not standard practice, while the recommendations fail to become standard practice because few scientists adopt them.


Addressing misinterpretation of ''p''-values

Although statisticians are unanimous that the use of "''p'' < 0.05" as a standard for significance provides weaker evidence than is generally appreciated, there is a lack of unanimity about what should be done about it. Some have advocated that
Bayesian methods Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inferen ...
should replace ''p''-values. This has not happened on a wide scale, partly because it is complicated and partly because many users distrust the specification of prior distributions in the absence of hard data. A simplified version of the Bayesian argument, based on testing a point null hypothesis was suggested by pharmacologist David Colquhoun. The logical problems of inductive inference were discussed in "The Problem with p-values" (2016). The hazards of reliance on ''p''-values arises partly because even an observation of ''p'' = 0.001 is not necessarily strong evidence against the null hypothesis. Despite the fact that the likelihood ratio in favor of the alternative hypothesis over the null is close to 100, if the hypothesis was implausible, with a prior probability of a real effect being 0.1, even the observation of ''p'' = 0.001 would have a false positive risk of 8 percent. It would still fail to reach the 5 percent level. It was recommended that the terms "significant" and "non-significant" should not be used. ''p''-values and confidence intervals should still be specified, but they should be accompanied by an indication of the false-positive risk. It was suggested that the best way to do this is to calculate the prior probability that would be necessary to believe in order to achieve a false positive risk of a certain level, such as 5%. The calculations can be done with various computer software. This reverse Bayesian approach, which physicist Robert Matthews suggested in 2001, is one way to avoid the problem that the prior probability is rarely known.


Encouraging larger sample sizes

To improve the quality of replications, larger sample sizes than those used in the original study are often needed. Larger sample sizes are needed because estimates of
effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
s in published work are often exaggerated due to publication bias and large sampling variability associated with small sample sizes in an original study. Further, using significance thresholds usually leads to inflated effects, because particularly with small sample sizes, only the largest effects will become significant.


Cross-validation

One common statistical problem is
overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
, that is, when researchers fit a regression model over a large number of variables but a small number of data points. For example, a typical fMRI study of emotion, personality, and social cognition has fewer than 100 subjects, but each subject has 10,000 voxels. The study would fit a sparse linear regression model that uses the voxels to predict a variable of interest, such as self-reported stress. But the study would then report on the p-value of the model ''on the same data'' it was fitted to. The standard approach in statistics, where data is split into a training and a validation set, is resisted because test subjects are expensive to acquire. One possible solution is cross-validation, which allows model validation while also allowing the whole dataset to be used for model-fitting.


Replication efforts


Funding

In July 2016, the Netherlands Organisation for Scientific Research made €3 million available for replication studies. The funding is for replication based on reanalysis of existing data and replication by collecting and analysing new data. Funding is available in the areas of social sciences, health research and healthcare innovation. In 2013, the Laura and John Arnold Foundation funded the launch of The Center for Open Science with a $5.25 million grant. By 2017, it provided an additional $10 million in funding. It also funded the launch of the Meta-Research Innovation Center at Stanford at Stanford University run by Ioannidis and medical scientist Steven Goodman to study ways to improve scientific research. It also provided funding for the
AllTrials AllTrials (sometimes called All Trials or AllTrials.net) is a project advocating that clinical research adopt the principles of open research. The project summarizes itself as "All trials registered, all results reported": that is, all clinical tr ...
initiative led in part by medical scientist Ben Goldacre.


Emphasis in post-secondary education

Based on coursework in experimental methods at MIT, Stanford, and the
University of Washington The University of Washington (UW and informally U-Dub or U Dub) is a public research university in Seattle, Washington, United States. Founded in 1861, the University of Washington is one of the oldest universities on the West Coast of the Uni ...
, it has been suggested that methods courses in psychology and other fields should emphasize replication attempts rather than original studies. Such an approach would help students learn scientific methodology and provide numerous independent replications of meaningful scientific findings that would test the replicability of scientific findings. Some have recommended that graduate students should be required to publish a high-quality replication attempt on a topic related to their doctoral research prior to graduation.


Replication database

There has been a concern that replication attempts have been growing. As a result, this may lead to lead to research waste. In turn, this has led to a need to systematically track replication attempts. As a result, several databases have been created (e.g.). The databases have created a Replication Database that includes psychology and speech-language therapy, among other disciplines, to promote theory-driven research and optimize the use of academic and institutional resource, while promoting trust in science.


= Final year thesis

= Some institutions require
undergraduate Undergraduate education is education conducted after secondary education and before postgraduate education, usually in a college or university. It typically includes all postsecondary programs up to the level of a bachelor's degree. For example, ...
students to submit a final year thesis that consists of an original piece of research. Daniel Quintana, a psychologist at the University of Oslo in Norway, has recommended that students should be encouraged to perform replication studies in thesis projects, as well as being taught about
open science Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of society, amateur or professional. Open science is transparent and accessib ...
.


= Semi-automated

= Researchers demonstrated a way of semi-automated testing for reproducibility: statements about experimental results were extracted from, as of 2022 non-semantic, gene expression cancer research papers and subsequently reproduced via robot scientist "
Eve Eve is a figure in the Book of Genesis in the Hebrew Bible. According to the origin story, "Creation myths are symbolic stories describing how the universe and its inhabitants came to be. Creation myths develop through oral traditions and there ...
". Problems of this approach include that it may not be feasible for many areas of research and that sufficient experimental data may not get extracted from some or many papers even if available.


Involving original authors

Psychologist
Daniel Kahneman Daniel Kahneman (; ; March 5, 1934 – March 27, 2024) was an Israeli-American psychologist best known for his work on the psychology of judgment and decision-making as well as behavioral economics, for which he was awarded the 2002 Nobel Memor ...
argued that, in psychology, the original authors should be involved in the replication effort because the published methods are often too vague. Others, such as psychologist Andrew Wilson, disagree, arguing that the original authors should write down the methods in detail. An investigation of replication rates in psychology in 2012 indicated higher success rates of replication in replication studies when there was author overlap with the original authors of a study (91.7% successful replication rates in studies with author overlap compared to 64.6% successful replication rates without author overlap).


Big team science

The replication crisis has led to the formation and development of various large-scale and collaborative communities to pool their resources to address a single question across cultures, countries and disciplines. The focus is on replication, to ensure that the effect generalizes beyond a specific culture and investigate whether the effect is replicable and genuine. This allows interdisciplinary internal reviews, multiple perspectives, uniform protocols across labs, and recruiting larger and more diverse samples. Researchers can collaborate by coordinating data collection or fund data collection by researchers who may not have access to the funds, allowing larger sample sizes and increasing the robustness of the conclusions.


Broader changes to scientific approach


Emphasize triangulation, not just replication

Psychologist Marcus R. Munafò and Epidemiologist George Davey Smith argue, in a piece published by ''
Nature Nature is an inherent character or constitution, particularly of the Ecosphere (planetary), ecosphere or the universe as a whole. In this general sense nature refers to the Scientific law, laws, elements and phenomenon, phenomena of the physic ...
'', that research should emphasize
triangulation In trigonometry and geometry, triangulation is the process of determining the location of a point by forming triangles to the point from known points. Applications In surveying Specifically in surveying, triangulation involves only angle m ...
, not just replication, to protect against flawed ideas. They claim that,


Complex systems paradigm

The dominant scientific and statistical model of causation is the linear model. The linear model assumes that mental variables are stable properties which are independent of each other. In other words, these variables are not expected to influence each other. Instead, the model assumes that the variables will have an independent, linear effect on observable outcomes. Social scientists Sebastian Wallot and Damian Kelty-Stephen argue that the linear model is not always appropriate. An alternative is the complex system model which assumes that mental variables are interdependent. These variables are not assumed to be stable, rather they will interact and adapt to each specific context. They argue that the complex system model is often more appropriate in psychology, and that the use of the linear model when the complex system model is more appropriate will result in failed replications.


Replication should seek to revise theories

Replication is fundamental for scientific progress to confirm original findings. However, replication alone is not sufficient to resolve the replication crisis. Replication efforts should seek not just to support or question the original findings, but also to replace them with revised, stronger theories with greater explanatory power. This approach therefore involves pruning existing theories, comparing all the alternative theories, and making replication efforts more generative and engaged in theory-building. However, replication alone is not enough, it is important to assess the extent that results generalise across geographical, historical and social contexts is important for several scientific fields, especially practitioners and policy makers to make analyses in order to guide important strategic decisions. Reproducible and replicable findings was the best predictor of generalisability beyond historical and geographical contexts, indicating that for social sciences, results from a certain time period and place can meaningfully drive as to what is universally present in individuals.


Open science

Open data, open source software and open source hardware all are critical to enabling reproducibility in the sense of validation of the original data analysis. The use of proprietary software, the lack of the publication of analysis software and the lack of open data prevents the replication of studies. Unless software used in research is open source, reproducing results with different software and hardware configurations is impossible.
CERN The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in Meyrin, western suburb of Gene ...
has both Open Data and CERN Analysis Preservation projects for storing data, all relevant information, and all software and tools needed to preserve an analysis at the large experiments of the
LHC The Large Hadron Collider (LHC) is the world's largest and highest-energy particle accelerator. It was built by the European Organization for Nuclear Research (CERN) between 1998 and 2008, in collaboration with over 10,000 scientists, and ...
. Aside from all software and data, preserved analysis assets include metadata that enable understanding of the analysis workflow, related software, systematic uncertainties, statistics procedures and meaningful ways to search for the analysis, as well as references to publications and to backup material. CERN software is open source and available for use outside of
particle physics Particle physics or high-energy physics is the study of Elementary particle, fundamental particles and fundamental interaction, forces that constitute matter and radiation. The field also studies combinations of elementary particles up to the s ...
and there is some guidance provided to other fields on the broad approaches and strategies used for open science in contemporary particle physics. Online repositories where data, protocols, and findings can be stored and evaluated by the public seek to improve the integrity and reproducibility of research. Examples of such repositories include the
Open Science Framework The Center for Open Science is a non-profit technology organization based in Charlottesville, Virginia with a mission to "increase the openness, integrity, and reproducibility of scientific research." Brian Nosek and Jeffrey Spies founded the or ...
,
Registry of Research Data Repositories The Registry of Research Data Repositories (re3data.org) is an open science tool that offers researchers, funding organizations, libraries, and publishers an overview of existing international data library, repositories for research data. Backg ...
, and Psychfiledrawer.org. Sites like Open Science Framework offer badges for using open science practices in an effort to incentivize scientists. However, there have been concerns that those who are most likely to provide their data and code for analyses are the researchers that are likely the most sophisticated. Ioannidis suggested that "the paradox may arise that the most meticulous and sophisticated and method-savvy and careful researchers may become more susceptible to criticism and reputation attacks by reanalyzers who hunt for errors, no matter how negligible these errors are".


See also

*
Base rate fallacy The base rate fallacy, also called base rate neglect or base rate bias, is a type of fallacy in which people tend to ignore the base rate (e.g., general prevalence) in favor of the individuating information (i.e., information pertaining only to a ...
*
Black swan theory The black swan theory or theory of black swan events is a metaphor that describes an event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight. The term arose from ...
*
Correlation does not imply causation The phrase "correlation does not imply causation" refers to the inability to legitimately deduce a cause-and-effect relationship between two events or variables solely on the basis of an observed association or correlation between them. The id ...
*
Data dredging Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Th ...
* Decline effect *
Estimation statistics Estimation statistics, or simply estimation, is a data analysis framework that uses a combination of effect sizes, confidence intervals, precision planning, and meta-analysis to plan experiments, analyze data and interpret results. It complement ...
*
Exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of data analysis, analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or ...
* Extension neglect *
Falsifiability Falsifiability (or refutability) is a deductive standard of evaluation of scientific theories and hypotheses, introduced by the Philosophy of science, philosopher of science Karl Popper in his book ''The Logic of Scientific Discovery'' (1934). ...
* Invalid science *
Misuse of statistics Statistics, when used in a misleading fashion, can trick the casual observer into believing something other than what the data shows. That is, a misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misu ...
* Naturalism *
Observer bias Observer bias is one of the types of detection bias and is defined as any kind of systematic divergence from accurate facts during observation and the recording of data and information in studies. The definition can be further expanded upon to inc ...
*
p-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
*
Problem of induction The problem of induction is a philosophical problem that questions the rationality of predictions about unobserved things based on previous observations. These inferences from the observed to the unobserved are known as "inductive inferences" ...
*
Sampling bias In statistics, sampling bias is a bias (statistics), bias in which a sample is collected in such a way that some members of the intended statistical population, population have a lower or higher sampling probability than others. It results in a b ...
*
Selection bias Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population inte ...
*
Statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...
*
Uniformitarianism Uniformitarianism, also known as the Doctrine of Uniformity or the Uniformitarian Principle, is the assumption that the same natural laws and processes that operate in our present-day scientific observations have always operated in the universe in ...


Notes


References


Further reading

* * Bonett, D.G. (2021). Design and analysis of replication studies. Organizational Research Methods, 24, 513–529. https://doi.org/10.1177/1094428120911088 * * * *
Book Review
(November 2020, '' The American Conservative'') * review of {{refend Scientific method Criticism of science Ethics and statistics Metascience Statistical reliability