
In pattern recognition, information
retrieval,
object detection and
classification (machine learning)
When classification is performed by a computer, statistical methods are normally used to develop the algorithm.
Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or ''fe ...
, precision and recall are performance metrics that apply to data retrieved from a
collection,
corpus
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
or
sample space
In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...
.
Precision (also called
positive predictive value) is the fraction of relevant instances among the retrieved instances. Written as a formula:
Recall (also known as
sensitivity) is the fraction of relevant instances that were retrieved. Written as a formula:
Both precision and recall are therefore based on
relevance
Relevance is the connection between topics that makes one useful for dealing with the other. Relevance is studied in many different fields, including cognitive science, logic, and library and information science. Epistemology studies it in gener ...
.
Consider a computer program for recognizing dogs (the relevant element) in a digital photograph. Upon processing a picture which contains ten cats and twelve dogs, the program identifies eight dogs. Of the eight elements identified as dogs, only five actually are dogs (
true positives), while the other three are cats (
false positives
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test res ...
). Seven dogs were missed (
false negatives), and seven cats were correctly excluded (
true negatives). The program's precision is then 5/8 (true positives / selected elements) while its recall is 5/12 (true positives / relevant elements).
Adopting a
hypothesis-testing approach, where in this case, the
null hypothesis
The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
is that a given item is ''irrelevant'' (not a dog), absence of
type I and type II errors
Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...
(perfect
specificity and sensitivity) corresponds respectively to perfect precision (no false positives) and perfect recall (no false negatives).
More generally, recall is simply the complement of the type II error rate (i.e., one minus the type II error rate). Precision is related to the type I error rate, but in a slightly more complicated way, as it also depends upon the
prior distribution
A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
of seeing a relevant vs. an irrelevant item.
The above cat and dog example contained 8 − 5 = 3 type I errors (false positives) out of 10 total cats (true negatives), for a type I error rate of 3/10, and 12 − 5 = 7 type II errors (false negatives), for a type II error rate of 7/12. Precision can be seen as a measure of quality, and recall as a measure of quantity.
Higher precision means that an algorithm returns more relevant results than irrelevant ones, and high recall means that an algorithm returns most of the relevant results (whether or not irrelevant ones are also returned).
Introduction

In a
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
task, the precision for a class is the ''number of true positives'' (i.e. the number of items correctly labelled as belonging to the positive class) ''divided by the total number of elements labelled as belonging to the positive class'' (i.e. the sum of true positives and
false positives
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test res ...
, which are items incorrectly labelled as belonging to the class). Recall in this context is defined as the ''number of true positives divided by the total number of elements that actually belong to the positive class'' (i.e. the sum of true positives and
false negatives, which are items which were not labelled as belonging to the positive class but should have been).
Precision and recall are not particularly useful metrics when used in isolation. For instance, it is possible to have perfect recall by simply retrieving every single item. Likewise, it is possible to achieve perfect precision by selecting only a very small number of extremely likely items.
In a classification task, a precision score of 1.0 for a class C means that every item labelled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labelled correctly) whereas a recall of 1.0 means that every item from class C was labelled as belonging to class C (but says nothing about how many items from other classes were incorrectly also labelled as belonging to class C).
Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other, but context may dictate if one is more valued in a given situation:
A smoke detector is generally designed to commit many Type I errors (to alert in many situations when there is no danger), because the cost of a Type II error (failing to sound an alarm during a major fire) is prohibitively high. As such, smoke detectors are designed with recall in mind (to catch all real danger), even while giving little weight to the losses in precision (and making many false alarms). In the other direction,
Blackstone's ratio, "It is better that ten guilty persons escape than that one innocent suffer," emphasizes the costs of a Type I error (convicting an innocent person). As such, the criminal justice system is geared toward precision (not convicting innocents), even at the cost of losses in recall (letting more guilty people go free).
A brain surgeon removing a cancerous tumor from a patient's brain illustrates the tradeoffs as well: The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain they remove to ensure they have extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain cells they remove to ensure they extract only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).
Usually, precision and recall scores are not discussed in isolation. A ''precision-recall curve'' plots precision as a function of recall; usually precision will decrease as the recall increases. Alternatively, values for one measure can be compared for a fixed level at the other measure (e.g. ''precision at a recall level of 0.75'') or both are combined into a single measure. Examples of measures that are a combination of precision and recall are the
F-measure
In statistics, statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the Precision (information retrieval), precision and Recall (in ...
(the weighted
harmonic mean
In mathematics, the harmonic mean is a kind of average, one of the Pythagorean means.
It is the most appropriate average for ratios and rate (mathematics), rates such as speeds, and is normally only used for positive arguments.
The harmonic mean ...
of precision and recall), or the
Matthews correlation coefficient
In statistics, the phi coefficient, or mean square contingency coefficient, denoted by ''φ'' or ''r'φ'', is a measure of association for two binary variables.
In machine learning, it is known as the Matthews correlation coefficient (MCC) an ...
, which is a
geometric mean
In mathematics, the geometric mean is a mean or average which indicates a central tendency of a finite collection of positive real numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometri ...
of the chance-corrected variants: the
regression coefficients
Informedness
Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. In meteorology, this statistic is referred to as Peirce Skill Score (PSS), Hanssen–Kuipers Discriminant (HKD) ...
(DeltaP') and
Markedness
In linguistics and social sciences, markedness is the state of standing out as nontypical or divergent as opposed to regular or common. In a marked–unmarked relation, one term of an opposition is the broader, dominant one. The dominant defau ...
(DeltaP).
Accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''.
''Precision'' is how close the measurements are to each other.
The ...
is a weighted arithmetic mean of Precision and Inverse Precision (weighted by Bias) as well as a weighted arithmetic mean of Recall and Inverse Recall (weighted by Prevalence).
Inverse Precision and Inverse Recall are simply the Precision and Recall of the inverse problem where positive and negative labels are exchanged (for both real classes and prediction labels).
True Positive Rate and
False Positive Rate, or equivalently Recall and 1 - Inverse Recall, are frequently plotted against each other as
ROC curves and provide a principled mechanism to explore operating point tradeoffs. Outside of Information Retrieval, the application of Recall, Precision and F-measure are argued to be flawed as they ignore the true negative cell of the
contingency table
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...
, and they are easily manipulated by biasing the predictions.
The first problem is 'solved' by using
Accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''.
''Precision'' is how close the measurements are to each other.
The ...
and the second problem is 'solved' by discounting the chance component and renormalizing to
Cohen's kappa, but this no longer affords the opportunity to explore tradeoffs graphically. However,
Informedness
Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. In meteorology, this statistic is referred to as Peirce Skill Score (PSS), Hanssen–Kuipers Discriminant (HKD) ...
and
Markedness
In linguistics and social sciences, markedness is the state of standing out as nontypical or divergent as opposed to regular or common. In a marked–unmarked relation, one term of an opposition is the broader, dominant one. The dominant defau ...
are Kappa-like renormalizations of Recall and Precision, and their geometric mean
Matthews correlation coefficient
In statistics, the phi coefficient, or mean square contingency coefficient, denoted by ''φ'' or ''r'φ'', is a measure of association for two binary variables.
In machine learning, it is known as the Matthews correlation coefficient (MCC) an ...
thus acts like a debiased F-measure.
Definition
For classification tasks, the terms ''true positives'', ''true negatives'', ''false positives'', and ''false negatives'' compare the results of the classifier under test with trusted external judgments. The terms ''positive'' and ''negative'' refer to the classifier's prediction (sometimes known as the ''expectation''), and the terms ''true'' and ''false'' refer to whether that prediction corresponds to the external judgment (sometimes known as the ''observation'').
Let us define an experiment from ''P'' positive instances and ''N'' negative instances for some condition. The four outcomes can be formulated in a 2×2
contingency table
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...
or
confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a super ...
, as follows:
Precision and recall are then defined as:
[Olson, David L.; and Delen, Dursun (2008); ''Advanced Data Mining Techniques'', Springer, 1st edition (February 1, 2008), page 138, ]
Recall in this context is also referred to as the true positive rate or
sensitivity, and precision is also referred to as
positive predictive value (PPV); other related measures used in classification include true negative rate and
accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''.
''Precision'' is how close the measurements are to each other.
The ...
.
True negative rate is also called
specificity.
Precision vs. Recall
Both precision and recall may be useful in cases where there is imbalanced data. However, it may be valuable to prioritize one metric over the other in cases where the outcome of a false positive or false negative is costly. For example, in medical diagnosis, a false positive test can lead to unnecessary treatment and expenses. In this situation, it is useful to value precision over recall. In other cases, the cost of a false negative is high, and recall may be a more valuable metric. For instance, the cost of a false negative in fraud detection is high, as failing to detect a fraudulent transaction can result in significant financial loss.
Probabilistic Definition
Precision and recall can be interpreted as (estimated)
conditional probabilities:
Precision is given by
while recall is given by
, where
is the predicted class and
is the actual class (i.e.
means the actual class is positive). Both quantities are, therefore, connected by
Bayes' theorem
Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...
.
No-Skill Classifiers
The probabilistic interpretation allows to easily derive how a no-skill classifier would perform. A no-skill classifiers is defined by the property that the joint probability
is just the product of the unconditional probabilities since the classification and the presence of the class are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
.
For example the precision of a no-skill classifier is simply a constant
i.e. determined by the probability/frequency with which the class P occurs.
A similar argument can be made for the recall:
which is the probability for a positive classification.
Imbalanced data
Accuracy can be a misleading metric for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score. There are many metrics that don't suffer from this problem. For example, balanced accuracy (bACC) normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two:
For the previous example (95 negative and 5 positive samples), classifying all as negative gives 0.5 balanced accuracy score (the maximum bACC score is one), which is equivalent to the expected value of a random guess in a balanced data set. Balanced accuracy can serve as an overall performance metric for a model, whether or not the true labels are imbalanced in the data, assuming the cost of FN is the same as FP.
The TPR and FPR are a property of a given classifier operating at a specific threshold. However, the overall number of TPs, FPs ''etc'' depend on the class imbalance in the data via the class ratio
. As the recall (or TPR) depends only on positive cases, it is not affected by
, but the precision is. We have that
Thus the precision has an explicit dependence on
. Starting with balanced classes at
and gradually decreasing
, the corresponding precision will decrease, because the denominator increases.
Another metric is the predicted positive condition rate (PPCR), which identifies the percentage of the total population that is flagged. For example, for a search engine that returns 30 results (retrieved documents) out of 1,000,000 documents, the PPCR is 0.003%.
According to Saito and Rehmsmeier, precision-recall plots are more informative than ROC plots when evaluating binary classifiers on imbalanced data. In such scenarios, ROC plots may be visually deceptive with respect to conclusions about the reliability of classification performance.
Different from the above approaches, if an imbalance scaling is applied directly by weighting the confusion matrix elements, the standard metrics definitions still apply even in the case of imbalanced datasets. The weighting procedure relates the confusion matrix elements to the support set of each considered class.
F-measure
A measure that combines precision and recall is the
harmonic mean
In mathematics, the harmonic mean is a kind of average, one of the Pythagorean means.
It is the most appropriate average for ratios and rate (mathematics), rates such as speeds, and is normally only used for positive arguments.
The harmonic mean ...
of precision and recall, the traditional F-measure or balanced F-score:
This measure is approximately the average of the two when they are close, and is more generally the
harmonic mean
In mathematics, the harmonic mean is a kind of average, one of the Pythagorean means.
It is the most appropriate average for ratios and rate (mathematics), rates such as speeds, and is normally only used for positive arguments.
The harmonic mean ...
, which, for the case of two numbers, coincides with the square of the
geometric mean
In mathematics, the geometric mean is a mean or average which indicates a central tendency of a finite collection of positive real numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometri ...
divided by the
arithmetic mean
In mathematics and statistics, the arithmetic mean ( ), arithmetic average, or just the ''mean'' or ''average'' is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results fr ...
. There are several reasons that the F-score can be criticized, in particular circumstances, due to its bias as an evaluation metric.
This is also known as the
measure, because recall and precision are evenly weighted.
It is a special case of the general
measure (for non-negative real values of
):
Two other commonly used
measures are the
measure, which weights recall higher than precision, and the
measure, which puts more emphasis on precision than recall.
The F-measure was derived by van Rijsbergen (1979) so that
"measures the effectiveness of retrieval with respect to a user who attaches
times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure
, the second term being the weighted harmonic mean of precision and recall with weights
. Their relationship is
where
.
Limitations as goals
There are other parameters and strategies for performance metric of information retrieval system, such as the area under the
ROC curve (AUC)
[Zygmunt Zając. What you wanted to know about AUC. http://fastml.com/what-you-wanted-to-know-about-auc/] or
pseudo-R-squared.
Multi-class evaluation
Precision and recall values can also be calculated for classification problems with more than two classes.
To obtain the precision for a given class, we divide the number of true positives by the classifier bias towards this class (number of times that the classifier has predicted the class). To calculate the recall for a given class, we divide the number of true positives by the prevalence of this class (number of times that the class occurs in the data sample).
The class-wise precision and recall values can then be combined into an overall multi-class evaluation score, e.g., using the
macro F1 metric.
See also
*
Uncertainty coefficient
In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal Association (statistics), association. It was first introduced by Henri Theil and is based on the concept of informatio ...
, also called ''proficiency''
*
Sensitivity and specificity
In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do ...
*
Confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a super ...
*
Scoring rule
*
Base rate fallacy
The base rate fallacy, also called base rate neglect or base rate bias, is a type of fallacy in which people tend to ignore the base rate (e.g., general prevalence) in favor of the individuating information (i.e., information pertaining only to a ...
References
* Baeza-Yates, Ricardo; Ribeiro-Neto, Berthier (1999). ''Modern Information Retrieval''. New York, NY: ACM Press, Addison-Wesley, Seiten 75 ff.
* Hjørland, Birger (2010); ''The foundation of the concept of relevance'', Journal of the American Society for Information Science and Technology, 61(2), 217-237
*
Makhoul, John; Kubala, Francis; Schwartz, Richard; and Weischedel, Ralph (1999)
''Performance measures for information extraction'' in ''Proceedings of DARPA Broadcast News Workshop, Herndon, VA, February 1999''
* van Rijsbergen, Cornelis Joost "Keith" (1979); ''Information Retrieval'', London, GB; Boston, MA: Butterworth, 2nd Edition,
External links
{{Machine learning evaluation metrics
Information retrieval evaluation
Information science
Bioinformatics