In the field of statistics, bias is a systematic tendency in which the methods used to gather

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

and

estimate Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...

sample statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...

present an inaccurate, skewed or distorted (''

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

ed'') depiction of reality. Statistical bias exists in numerous stages of the data collection and analysis process, including: the source of the data, the methods used to collect the data, the

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...

chosen, and the methods used to analyze the data. Data analysts can take various measures at each stage of the process to reduce the impact of statistical bias in their work. Understanding the source of statistical bias can help to assess whether the observed results are close to actuality. Issues of statistical bias has been argued to be closely linked to issues of

statistical validity Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool ...

. Statistical bias can have significant real world implications as data is used to inform decision making across a wide variety of processes in society. Data is used to inform lawmaking, industry regulation, corporate marketing and distribution tactics, and institutional policies in organizations and workplaces. Therefore, there can be significant implications if statistical bias is not accounted for and controlled. For example, if a pharmaceutical company wishes to explore the effect of a medication on the common cold but the data sample only includes men, any conclusions made from that data will be biased towards how the medication affects men rather than people in general. That means the information would be incomplete and not useful for deciding if the medication is ready for release in the general public. In this scenario, the bias can be addressed by broadening the sample. This

sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ...

is only one of the ways in which data can be biased. Bias can be differentiated from other statistical mistakes such as

accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''. ''Precision'' is how close the measurements are to each other. The ...

(instrument failure/inadequacy), lack of data, or mistakes in transcription (typos). Bias implies that the data selection may have been skewed by the collection criteria. Other forms of human-based bias emerge in data collection as well such as

response bias Response bias is a general term for a wide range of tendencies for participants to respond inaccurately or falsely to questions. These biases are prevalent in research involving participant self-report, such as structured interviews or surveys. R ...

, in which participants give inaccurate responses to a question. Bias does not preclude the existence of any other mistakes. One may have a poorly designed sample, an inaccurate measurement device, and typos in recording data simultaneously. Ideally, all factors are controlled and accounted for. Also it is useful to recognize that the term “error” specifically refers to the outcome rather than the process ( errors of rejection or acceptance of the hypothesis being tested), or from the phenomenon of

random errors Observational error (or measurement error) is the difference between a measured value of a quantity and its unknown true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. Such errors are inherent in the measurement pr ...

. The terms ''flaw'' or ''mistake'' are recommended to differentiate procedural errors from these specifically defined outcome-based terms.

Bias of an estimator

Statistical bias is a feature of a

statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

technique or of its results whereby the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

of the results differs from the true underlying quantitative

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

being

estimated Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...

. The bias of an estimator of a parameter should not be confused with its degree of precision, as the degree of precision is a measure of the sampling error. The bias is defined as follows: let

T

be a statistic used to estimate a parameter

\theta

, and let

\operatorname E(T)

denote the expected value of

T

. Then, :

\operatorname(T, \theta) = \operatorname(T) = \operatorname E(T) - \theta

is called the bias of the statistic

T

(with respect to

\theta

). If

\operatorname(T, \theta)=0

, then

T

is said to be an ''unbiased estimator'' of

\theta

; otherwise, it is said to be a ''biased estimator'' of

\theta

. The bias of a statistic

T

is always relative to the parameter

\theta

it is used to estimate, but the parameter

\theta

is often omitted when it is clear from the context what is being estimated.

Types

Statistical bias comes from all stages of data analysis. The following sources of bias will be listed in each stage separately.

Data selection

Selection bias Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population inte ...

involves individuals being more likely to be selected for study than others, biasing the sample. This can also be termed selection effect,

sampling bias In statistics, sampling bias is a bias (statistics), bias in which a sample is collected in such a way that some members of the intended statistical population, population have a lower or higher sampling probability than others. It results in a b ...

and '' Berksonian bias''. *

Spectrum bias In biostatistics, spectrum bias refers to the phenomenon that the performance of a diagnostic test may vary in different clinical settings because each setting has a different mix of patients. Because the performance may be dependent on the mix of ...

arises from evaluating diagnostic tests on biased patient samples, leading to an overestimate of the

sensitivity and specificity In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do ...

of the test. For example, a high prevalence of disease in a study population increases positive predictive values, which will cause a bias between the prediction values and the real ones. * Observer selection bias occurs when the evidence presented has been pre-filtered by observers, which is so-called

anthropic principle In cosmology, the anthropic principle, also known as the observation selection effect, is the proposition that the range of possible observations that could be made about the universe is limited by the fact that observations are only possible in ...

. The data collected is not only filtered by the design of experiment, but also by the necessary precondition that there must be someone doing a study. An example is the impact of the Earth in the past. The impact event may cause the extinction of intelligent animals, or there were no intelligent animals at that time. Therefore, some impact events have not been observed, but they may have occurred in the past. *Volunteer bias occurs when volunteers have intrinsically different characteristics from the target population of the study. Research has shown that volunteers tend to come from families with higher socioeconomic status. Furthermore, another study shows that women are more probable to volunteer for studies than men. *

Funding bias Funding bias, also known as sponsorship bias, funding outcome bias, funding publication bias, and funding effect, is a tendency of a scientific study to support the interests of the study's financial sponsor. This phenomenon is recognized sufficie ...

may lead to the selection of outcomes, test samples, or test procedures that favor a study's financial sponsor. *

Attrition bias Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population inte ...

arises due to a loss of participants, e.g., loss of follow up during a study. *

Recall bias In epidemiological research, recall bias is a systematic error caused by differences in the accuracy or completeness of the recollections retrieved ("recalled") by study participants regarding events or experiences from the past. It is sometimes a ...

arises due to differences in the accuracy or completeness of participant recollections of past events; for example, patients cannot recall how many cigarettes they smoked last week exactly, leading to over-estimation or under-estimation.

Hypothesis testing

Type I and type II errors Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...

statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

leads to wrong results. Type I error happens when the null hypothesis is correct but is rejected. For instance, suppose that the null hypothesis is that if the average driving speed limit ranges from 75 to 85 km/h, it is not considered as speeding. On the other hand, if the average speed is not in that range, it is considered speeding. If someone receives a ticket with an average driving speed of 7 km/h, the decision maker has committed a Type I error. In other words, the average driving speed meets the null hypothesis but is rejected. On the contrary, Type II error happens when the null hypothesis is not correct but is accepted. Bias in hypothesis testing occurs when the power (the complement of the type II error rate) at some alternative is lower than the supremum of the Type I error rate (which is usually the significance level,

\alpha

). Equivalently, if no rejection rate at any alternative is lower than the rejection rate at any point in the null hypothesis set, the test is said to be unbiased.

Estimator selection

The

bias of an estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In st ...

is the difference between an estimator's expected value and the true value of the parameter being estimated. Although an unbiased estimator is theoretically preferable to a biased estimator, in practice, biased estimators with small biases are frequently used. A biased estimator may be more useful for several reasons. First, an unbiased estimator may not exist without further assumptions. Second, sometimes an unbiased estimator is hard to compute. Third, a biased estimator may have a lower value of mean squared error. * A biased estimator is better than any unbiased estimator arising from the

Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...

. The value of a biased estimator is always positive and the mean squared error of it is smaller than the unbiased one, which makes the biased estimator be more accurate. *

Omitted-variable bias In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included. More specifically, O ...

is the bias that appears in estimates of parameters in regression analysis when the assumed specification omits an independent variable that should be in the model.

Analysis methods

* Detection bias occurs when a phenomenon is more likely to be observed for a particular set of study subjects. For instance, the

syndemic Syndemics is the evaluation of how social and health conditions arise, in what ways they interact, and what upstream drivers may produce their interactions. The word is a blend of "synergy" and "epidemics". The idea of syndemics is that no disea ...

involving

obesity Obesity is a medical condition, considered by multiple organizations to be a disease, in which excess Adipose tissue, body fat has accumulated to such an extent that it can potentially have negative effects on health. People are classifi ...

and

diabetes Diabetes mellitus, commonly known as diabetes, is a group of common endocrine diseases characterized by sustained high blood sugar levels. Diabetes is due to either the pancreas not producing enough of the hormone insulin, or the cells of th ...

may mean doctors are more likely to look for diabetes in obese patients than in thinner patients, leading to an inflation in diabetes among obese patients because of skewed detection efforts. * In

educational measurement Educational measurement refers to the use of educational assessments and the analysis of data such as scores obtained from educational assessments to infer the abilities and proficiencies of students. The approaches overlap with those in psychomet ...

, bias is defined as "Systematic errors in test content, test administration, and/or scoring procedures that can cause some test takers to get either lower or higher scores than their true ability would merit." The source of the bias is irrelevant to the trait the test is intended to measure. *

Observer bias Observer bias is one of the types of detection bias and is defined as any kind of systematic divergence from accurate facts during observation and the recording of data and information in studies. The definition can be further expanded upon to inc ...

arises when the researcher subconsciously influences the experiment due to

cognitive bias A cognitive bias is a systematic pattern of deviation from norm (philosophy), norm or rationality in judgment. Individuals create their own "subjective reality" from their perception of the input. An individual's construction of reality, not the ...

where judgment may alter how an experiment is carried out / how results are recorded.

Interpretation

Reporting bias In epidemiology, reporting bias is defined as "selective revealing or suppression of information" by subjects (for example about past medical history, smoking, sexual experiences). In artificial intelligence research, the term reporting bias is u ...

involves a skew in the availability of data, such that observations of a certain kind are more likely to be reported.

Addressing statistical bias

Depending on the type of bias present, researchers and analysts can take different steps to reduce bias on a data set. All types of bias mentioned above have corresponding measures which can be taken to reduce or eliminate their impacts. Bias should be accounted for at every step of the data collection process, beginning with clearly defined research parameters and consideration of the team who will be conducting the research.

may be reduced by implementing a blind or

double-blind In a blind or blinded experiment, information which may influence the participants of the experiment is withheld until after the experiment is complete. Good blinding can reduce or eliminate experimental biases that arise from a participants' expec ...

technique. Avoidance of

p-hacking Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Thi ...

is essential to the process of accurate data collection. One way to check for bias in results after is rerunning analyses with different independent variables to observe whether a given phenomenon still occurs in dependent variables. Careful use of language in reporting can reduce misleading phrases, such as discussion of a result "approaching" statistical significant as compared to actually achieving it.

References

External links

The Catalogue of Bias
is a project at the

Centre for Evidence-Based Medicine The Centre for Evidence-Based Medicine (CEBM), based in the Nuffield Department of Primary Care Health Sciences at the University of Oxford, is an academic-led centre dedicated to the practice, teaching, and dissemination of high quality evidenc ...

cataloguing biases that affect health evidence. {{Authority control

Statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

Accuracy and precision