statistics Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

, sampling bias is a

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...

in which a sample is collected in such a way that some members of the intended

population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...

have a lower or higher

sampling probability In statistics, in the theory relating to sampling from finite populations, the sampling probability (also known as inclusion probability) of an element or member of the population, is its probability of becoming part of the sample during the dra ...

than others. It results in a biased sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling. Medical sources sometimes refer to sampling bias as ascertainment bias. Ascertainment bias has basically the same definition, but is still sometimes classified as a separate type of bias.

Distinction from selection bias

Sampling bias is usually classified as a subtype of

selection bias Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population in ...

, sometimes specifically termed sample selection bias, but some classify it as a separate type of bias. A distinction, albeit not universally accepted, of sampling bias is that it undermines the

external validity External validity is the validity of applying the conclusions of a scientific study outside the context of that study. In other words, it is the extent to which the results of a study can be generalized to and across other situations, people, stim ...

of a test (the ability of its results to be generalized to the entire population), while

mainly addresses

internal validity Internal validity is the extent to which a piece of evidence supports a claim about cause and effect, within the context of a particular study. It is one of the most important properties of scientific studies and is an important concept in reasoni ...

for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias. However, selection bias and sampling bias are often used synonymously.

Types

* Selection from a specific real area. For example, a survey of high school students to measure teenage use of illegal drugs will be a biased sample because it does not include home-schooled students or dropouts. A sample is also biased if certain members are underrepresented or overrepresented relative to others in the population. For example, a "man on the street" interview which selects people who walk by a certain location is going to have an overrepresentation of healthy individuals who are more likely to be out of the home than individuals with a chronic illness. This may be an extreme form of biased sampling, because certain members of the population are totally excluded from the sample (that is, they have zero probability of being selected). *

Self-selection In statistics, self-selection bias arises in any situation in which individuals select themselves into a group, causing a biased sample with nonprobability sampling. It is commonly used to describe situations where the characteristics of the peo ...

bias (see also

Non-response bias Participation bias or non-response bias is a phenomenon in which the results of elections, studies, polls, etc. become non-representative because the participants disproportionately possess certain traits which affect the outcome. These traits mea ...

), which is possible whenever the group of people being studied has any form of control over whether to participate (as current standards of human-subject research ethics require for many real-time and some longitudinal forms of study). Participants' decision to participate may be correlated with traits that affect the study, making the participants a non-representative sample. For example, people who have strong opinions or substantial knowledge may be more willing to spend time answering a survey than those who do not. Another example is online and phone-in polls, which are biased samples because the respondents are self-selected. Those individuals who are highly motivated to respond, typically individuals who have strong opinions, are overrepresented, and individuals that are indifferent or apathetic are less likely to respond. This often leads to a polarization of responses with extreme perspectives being given a disproportionate weight in the summary. As a result, these types of polls are regarded as unscientific. * Exclusion bias results from exclusion of particular groups from the sample, e.g. exclusion of subjects who have recently migrated into the study area (this may occur when newcomers are not available in a register used to identify the source population). Excluding subjects who move out of the study area during follow-up is rather equivalent of dropout or nonresponse, a

in that it rather affects the internal validity of the study. *

Healthy user bias The healthy user bias or healthy worker bias is a bias that can damage the validity of epidemiologic studies testing the efficacy of particular therapies or interventions. Specifically, it is a sampling bias or selection bias: the kind of subject ...

, when the study population is likely healthier than the general population. For example, someone in poor health is unlikely to have a job as manual laborer. * Berkson's fallacy, when the study population is selected from a hospital and so is less healthy than the general population. This can result in a spurious negative correlation between diseases: a hospital patient without diabetes is ''more'' likely to have another given disease such as

cholecystitis Cholecystitis is inflammation of the gallbladder. Symptoms include right upper abdominal pain, pain in the right shoulder, nausea, vomiting, and occasionally fever. Often gallbladder attacks (biliary colic) precede acute cholecystitis. The pain l ...

, since they must have had some reason to enter the hospital in the first place. *

Overmatching Matching is a statistical technique which is used to evaluate the effect of a treatment by comparing the treated and the non-treated units in an observational study or quasi-experiment (i.e. when the treatment is not randomly assigned). The goal of ...

, matching for an apparent confounder that actually is a result of the exposure. The control group becomes more similar to the cases in regard to exposure than does the general population. *

Survivorship bias Survivorship bias or survival bias is the logical error of concentrating on entities that passed a selection process while overlooking those that did not. This can lead to incorrect conclusions because of incomplete data. Survivorship bias is ...

, in which only "surviving" subjects are selected, ignoring those that fell out of view. For example, using the record of current companies as an indicator of business climate or economy ignores the businesses that failed and no longer exist. * Malmquist bias, an effect in observational astronomy which leads to the preferential detection of intrinsically bright objects.

Symptom-based sampling

The study of medical conditions begins with anecdotal reports. By their nature, such reports only include those referred for diagnosis and treatment. A child who can't function in school is more likely to be diagnosed with

dyslexia Dyslexia, also known until the 1960s as word blindness, is a disorder characterized by reading below the expected level for one's age. Different people are affected to different degrees. Problems may include difficulties in spelling words, r ...

than a child who struggles but passes. A child examined for one condition is more likely to be tested for and diagnosed with other conditions, skewing

comorbidity In medicine, comorbidity - from Latin morbus ("sickness"), co ("together"), -ity (as if - several sicknesses together) - is the presence of one or more additional conditions often co-occurring (that is, concomitant or concurrent) with a primary ...

statistics. As certain diagnoses become associated with behavior problems or intellectual disability, parents try to prevent their children from being stigmatized with those diagnoses, introducing further bias. Studies carefully selected from whole populations are showing that many conditions are much more common and usually much milder than formerly believed.

Truncate selection in pedigree studies

Geneticists are limited in how they can obtain data from human populations. As an example, consider a human characteristic. We are interested in deciding if the characteristic is inherited as a simple Mendelian trait. Following the laws of

Mendelian inheritance Mendelian inheritance (also known as Mendelism) is a type of biological inheritance following the principles originally proposed by Gregor Mendel in 1865 and 1866, re-discovered in 1900 by Hugo de Vries and Carl Correns, and later popularize ...

, if the parents in a family do not have the characteristic, but carry the allele for it, they are carriers (e.g. a non-expressive

heterozygote Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism. Mo ...

). In this case their children will each have a 25% chance of showing the characteristic. The problem arises because we can't tell which families have both parents as carriers (heterozygous) unless they have a child who exhibits the characteristic. The description follows the textbook by Sutton. The figure shows the pedigrees of all the possible families with two children when the parents are carriers (Aa). * Nontruncate selection. In a perfect world we should be able to discover all such families with a gene including those who are simply carriers. In this situation the analysis would be free from ascertainment bias and the pedigrees would be under "nontruncate selection" In practice, most studies identify, and include, families in a study based upon them having affected individuals. * Truncate selection. When afflicted ''individuals'' have an equal chance of being included in a study this is called truncate selection, signifying the inadvertent exclusion (truncation) of families who are carriers for a gene. Because selection is performed on the individual level, families with two or more affected children would have a higher probability of becoming included in the study. * Complete truncate selection is a special case where each ''family'' with an affected child has an equal chance of being selected for the study. The probabilities of each of the families being selected is given in the figure, with the sample frequency of affected children also given. In this simple case, the researcher will look for a frequency of or for the characteristic, depending on the type of truncate selection used.

The caveman effect

An example of selection bias is called the "caveman effect". Much of our understanding of

prehistoric Prehistory, also known as pre-literary history, is the period of human history between the use of the first stone tools by hominins 3.3 million years ago and the beginning of recorded history with the invention of writing systems. The use o ...

peoples comes from caves, such as

cave painting In archaeology, Cave paintings are a type of parietal art (which category also includes petroglyphs, or engravings), found on the wall or ceilings of caves. The term usually implies prehistoric art, prehistoric origin, and the oldest known are mor ...

s made nearly 40,000 years ago. If there had been contemporary paintings on trees, animal skins or hillsides, they would have been washed away long ago. Similarly, evidence of fire pits,

midden A midden (also kitchen midden or shell heap) is an old dump for domestic waste which may consist of animal bone, human excrement, botanical material, mollusc shells, potsherds, lithics (especially debitage), and other artifacts and ...

burial sites Burial, also known as interment or inhumation, is a method of final disposition whereby a dead body is placed into the ground, sometimes with objects. This is usually accomplished by excavating a pit or trench, placing the deceased and objec ...

, etc. are most likely to remain intact to the modern era in caves. Prehistoric people are associated with caves because that is where the data still exists, not necessarily because most of them lived in caves for most of their lives.

Problems due to sampling bias

Sampling bias is problematic because it is possible that a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hyp ...

computed of the sample is systematically erroneous. Sampling bias can lead to a systematic over- or under-estimation of the corresponding

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

in the population. Sampling bias occurs in practice as it is practically impossible to ensure perfect randomness in sampling. If the degree of misrepresentation is small, then the sample can be treated as a reasonable approximation to a random sample. Also, if the sample does not differ markedly in the quantity being measured, then a biased sample can still be a reasonable estimate. The word

has a strong negative connotation. Indeed, biases sometimes come from deliberate intent to mislead or other

scientific fraud Scientific misconduct is the violation of the standard codes of scholarly conduct and ethical behavior in the publication of professional scientific research. A '' Lancet'' review on ''Handling of Scientific Misconduct in Scandinavian countrie ...

. In statistical usage, bias merely represents a mathematical property, no matter if it is deliberate or unconscious or due to imperfections in the instruments used for observation. While some individuals might deliberately use a biased sample to produce misleading results, more often, a biased sample is just a reflection of the difficulty in obtaining a truly representative sample, or ignorance of the bias in their process of measurement or analysis. An example of how ignorance of a bias can exist is in the widespread use of a ratio (a.k.a.

fold change Fold change is a measure describing how much a quantity changes between an original and a subsequent measurement. It is defined as the ratio between the two quantities; for quantities ''A'' and ''B'' the fold change of ''B'' with respect to ''A'' ...

) as a measure of difference in biology. Because it is easier to achieve a large ratio with two small numbers with a given difference, and relatively more difficult to achieve a large ratio with two large numbers with a larger difference, large significant differences may be missed when comparing relatively large numeric measurements. Some have called this a 'demarcation bias' because the use of a ratio (division) instead of a difference (subtraction) removes the results of the analysis from science into pseudoscience (See

Demarcation Problem In philosophy of science and epistemology, the demarcation problem is the question of how to distinguish between science and non-science. It examines the boundaries between science, pseudoscience, and other products of human activity, like art a ...

). Some samples use a biased statistical design which nevertheless allows the estimation of parameters. The U.S.

National Center for Health Statistics The National Center for Health Statistics (NCHS) is a U.S. government agency that provides statistical information to guide actions and policies to improve the public health of the American people. It is a unit of the Centers for Disease Control ...

, for example, deliberately oversamples from minority populations in many of its nationwide surveys in order to gain sufficient precision for estimates within these groups. These surveys require the use of sample weights (see later on) to produce proper estimates across all ethnic groups. Provided that certain conditions are met (chiefly that the weights are calculated and used correctly) these samples permit accurate estimation of population parameters.

Historical examples

A classic example of a biased sample and the misleading results it produced occurred in 1936. In the early days of opinion polling, the American ''

Literary Digest ''The Literary Digest'' was an influential American general interest weekly magazine published by Funk & Wagnalls. Founded by Isaac Kaufmann Funk in 1890, it eventually merged with two similar weekly magazines, ''Public Opinion'' and '' Current O ...

'' magazine collected over two million postal surveys and predicted that the Republican candidate in the U.S. presidential election,

Alf Landon Alfred Mossman Landon (September 9, 1887October 12, 1987) was an American oilman and politician who served as the 26th governor of Kansas from 1933 to 1937. A member of the Republican Party, he was the party's nominee in the 1936 presidential e ...

, would beat the incumbent president,

Franklin Roosevelt Franklin Delano Roosevelt (; ; January 30, 1882April 12, 1945), often referred to by his initials FDR, was an American politician and attorney who served as the 32nd president of the United States from 1933 until his death in 1945. As th ...

, by a large margin. The result was the exact opposite. The Literary Digest survey represented a sample collected from readers of the magazine, supplemented by records of registered automobile owners and telephone users. This sample included an over-representation of wealthy individuals, who, as a group, were more likely to vote for the Republican candidate. In contrast, a poll of only 50 thousand citizens selected by

George Gallup George Horace Gallup (November 18, 1901 – July 26, 1984) was an American pioneer of survey sampling techniques and inventor of the Gallup poll, a successful statistical method of survey sampling for measuring public opinion. Life and career ...

's organization successfully predicted the result, leading to the popularity of the

Gallup poll Gallup, Inc. is an American analytics and advisory company based in Washington, D.C. Founded by George Gallup in 1935, the company became known for its public opinion polls conducted worldwide. Starting in the 1980s, Gallup transitioned its b ...

. Another classic example occurred in the

1948 presidential election The following elections occurred in the year 1948. Africa * 1948 Mauritian general election * 1948 South African general election * 1948 Southern Rhodesian general election Asia * 1948 North Korean parliamentary election * 1948 Republic of China ...

. On election night, the

Chicago Tribune The ''Chicago Tribune'' is a daily newspaper based in Chicago, Illinois, United States, owned by Tribune Publishing. Founded in 1847, and formerly self-styled as the "World's Greatest Newspaper" (a slogan for which WGN radio and television a ...

printed the headline ''

DEWEY DEFEATS TRUMAN "Dewey Defeats Truman" was an incorrect banner headline on the front page of the ''Chicago Daily Tribune'' (later ''Chicago Tribune'') on November 3, 1948, the day after incumbent United States president Harry S. Truman won an upset victory ...

'', which turned out to be mistaken. In the morning the grinning

president-elect An ''officer-elect'' is a person who has been elected to a position but has not yet been installed. Notably, a president who has been elected but not yet installed would be referred to as a ''president-elect'' (e.g. president-elect of the Uni ...

Harry S. Truman Harry S. Truman (May 8, 1884December 26, 1972) was the 33rd president of the United States, serving from 1945 to 1953. A leader of the Democratic Party, he previously served as the 34th vice president from January to April 1945 under Franklin ...

, was photographed holding a newspaper bearing this headline. The reason the Tribune was mistaken is that their editor trusted the results of a phone survey. Survey research was then in its infancy, and few academics realized that a sample of telephone users was not representative of the general population. Telephones were not yet widespread, and those who had them tended to be prosperous and have stable addresses. (In many cities, the

Bell System The Bell System was a system of telecommunication companies, led by the Bell Telephone Company and later by the American Telephone and Telegraph Company (AT&T), that dominated the telephone services industry in North America for over one hund ...

telephone directory A telephone directory, commonly called a telephone book, telephone address book, phonebook, or the white and yellow pages, is a listing of telephone subscribers in a geographical area or subscribers to services provided by the organization tha ...

contained the same names as the

Social Register The ''Social Register'' is a semi-annual publication in the United States that indexes the members of American high society. First published in the 1880s by newspaper columnist Louis Keller, it was later acquired by Malcolm Forbes. Since 2014, it ...

). In addition, the Gallup poll that the Tribune based its headline on was over two weeks old at the time of the printing. In

air quality Air pollution is the contamination of air due to the presence of substances in the atmosphere that are harmful to the health of humans and other living beings, or cause damage to the climate or to materials. There are many different ty ...

data, pollutants (such as

carbon monoxide Carbon monoxide ( chemical formula CO) is a colorless, poisonous, odorless, tasteless, flammable gas that is slightly less dense than air. Carbon monoxide consists of one carbon atom and one oxygen atom connected by a triple bond. It is the simp ...

nitrogen monoxide Nitric oxide (nitrogen oxide or nitrogen monoxide) is a colorless gas with the formula . It is one of the principal oxides of nitrogen. Nitric oxide is a free radical: it has an unpaired electron, which is sometimes denoted by a dot in its ch ...

nitrogen dioxide Nitrogen dioxide is a chemical compound with the formula . It is one of several nitrogen oxides. is an intermediate in the industrial synthesis of nitric acid, millions of tons of which are produced each year for use primarily in the producti ...

, or

ozone Ozone (), or trioxygen, is an inorganic molecule with the chemical formula . It is a pale blue gas with a distinctively pungent smell. It is an allotrope of oxygen that is much less stable than the diatomic allotrope , breaking down in the lowe ...

) frequently show high

correlations In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

, as they stem from the same chemical process(es). These correlations depend on space (i.e., location) and time (i.e., period). Therefore, a pollutant distribution is not necessarily representative for every location and every period. If a low-cost measurement instrument is calibrated with field data in a multivariate manner, more precisely by collocation next to a reference instrument, the relationships between the different compounds are incorporated into the calibration model. By relocation of the measurement instrument, erroneous results can be produced. A twenty-first century example is the

COVID-19 pandemic The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The novel virus was first identif ...

, where variations in sampling bias in

COVID-19 testing COVID-19 testing involves analyzing samples to assess the current or past presence of SARS-CoV-2. The two main types of tests detect either the presence of the virus or antibodies produced in response to infection. Molecular tests for viral p ...

have been shown to account for wide variations in both case fatality rates and the age distribution of cases across countries.

Statistical corrections for a biased sample

If entire segments of the population are excluded from a sample, then there are no adjustments that can produce estimates that are representative of the entire population. But if some groups are underrepresented and the degree of underrepresentation can be quantified, then sample weights can correct the bias. However, the success of the correction is limited to the selection model chosen. If certain variables are missing the methods used to correct the bias could be inaccurate. For example, a hypothetical population might include 10 million men and 10 million women. Suppose that a biased sample of 100 patients included 20 men and 80 women. A researcher could correct for this imbalance by attaching a weight of 2.5 for each male and 0.625 for each female. This would adjust any estimates to achieve the same expected value as a sample that included exactly 50 men and 50 women, unless men and women differed in their likelihood of taking part in the survey.

References

{{DEFAULTSORT:Biased Sample Sampling (statistics) Misuse of statistics Experimental bias