statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

analysis of

binary classification Binary classification is the task of classifying the elements of a set into one of two groups (each called ''class''). Typical binary classification problems include: * Medical testing to determine if a patient has a certain disease or not; * Qual ...

and

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as

positive predictive value The positive and negative predictive values (PPV and NPV respectively) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV desc ...

, and recall is also known as sensitivity in diagnostic binary classification. The F₁ score is the

harmonic mean In mathematics, the harmonic mean is a kind of average, one of the Pythagorean means. It is the most appropriate average for ratios and rate (mathematics), rates such as speeds, and is normally only used for positive arguments. The harmonic mean ...

of the precision and recall. It thus symmetrically represents both precision and recall in one metric. The more generic

F_\beta

score applies additional weights, valuing one of precision or recall more than the other. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if the precision or the recall is zero.

Etymology

The name F-measure is believed to be named after a different F function in Van Rijsbergen's book, when introduced to the Fourth

Message Understanding Conference The Message Understanding Conferences (MUC) for computing and computer science, were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction. The ...

(MUC-4, 1992).

Definition

The traditional F-measure or balanced F-score (F₁ score) is the

of precision and recall: :

F_1 = \frac = 2 \frac = \frac

With and , it follows that the numerator of is the sum of their numerators and the denominator of is the sum of their denominators. To see it as a harmonic mean, note that

F_1^ = \frac 12 (\mathrm^ + \mathrm^)

F_β score

A more general F score,

F_\beta

, that uses a positive real factor

\beta

, where

\beta

is chosen such that recall is considered

\beta

times as important as precision, is: :

F_\beta = \frac = \frac

To see that it as a weighted harmonic mean, note that

F_\beta^ = \frac (\beta \cdot \mathrm^ + \beta^\cdot\mathrm^)

. In terms of

Type I and type II errors Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...

this becomes: :

F_\beta = \frac \,

Two commonly used values for

\beta

are 2, which weighs recall higher than precision, and 1/2, which weighs recall lower than precision. The F-measure was derived so that

F_\beta

"measures the effectiveness of retrieval with respect to a user who attaches

\beta

times as much importance to recall as precision". It is based on Van Rijsbergen's effectiveness measure :

E = 1 - \left(\frac + \frac\right)^

Their relationship is:

F_\beta = 1 - E

where

\alpha=\frac

Diagnostic testing

This is related to the field of

where recall is often termed "sensitivity". Harmonic_mean_3D_plot_from_0_to_100

Dependence of the F-score on class imbalance

Precision-recall curve, and thus the

F_\beta

score, explicitly depends on the ratio

r

of positive to negative test cases. This means that comparison of the F-score across different problems with differing class ratios is problematic. One way to address this issue (see e.g., Siblini et al., 2020 ) is to use a standard class ratio

r_0

when making such comparisons.

Applications

The F-score is often used in the field of

for measuring

search Searching may refer to: Music * "Searchin', Searchin", a 1957 song originally performed by The Coasters * Searching (China Black song), "Searching" (China Black song), a 1991 song by China Black * Searchin' (CeCe Peniston song), "Searchin" (C ...

document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...

, and query classification performance. It is particularly relevant in applications which are primarily concerned with the positive class and where the positive class is rare relative to the negative class. Earlier works focused primarily on the F₁ score, but with the proliferation of large scale search engines, performance goals changed to place more emphasis on either precision or recall and so

F_\beta

is seen in wide application. The F-score is also used in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

. However, the F-measures do not take true negatives into account, hence measures such as the

Matthews correlation coefficient In statistics, the phi coefficient, or mean square contingency coefficient, denoted by ''φ'' or ''r'φ'', is a measure of association for two binary variables. In machine learning, it is known as the Matthews correlation coefficient (MCC) an ...

Informedness Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. In meteorology, this statistic is referred to as Peirce Skill Score (PSS), Hanssen–Kuipers Discriminant (HKD) ...

or Cohen's kappa may be preferred to assess the performance of a binary classifier. The F-score has been widely used in the natural language processing literature, such as in the evaluation of named entity recognition and

word segmentation A word is a basic element of language that carries meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguists on its ...

Properties

The F₁ score is the Dice coefficient of the set of retrieved items and the set of relevant items. * The F₁-score of a classifier which always predicts the positive class converges to 1 as the probability of the positive class increases. * The F₁-score of a classifier which always predicts the positive class is equal to 2 * proportion_of_positive_class / ( 1 + proportion_of_positive_class ), since the recall is 1, and the precision is equal to the proportion of the positive class. * If the scoring model is uninformative (cannot distinguish between the positive and negative class) then the optimal threshold is 0 so that the positive class is always predicted. * F₁ score is

concave Concave or concavity may refer to: Science and technology * Concave lens * Concave mirror Mathematics * Concave function, the negative of a convex function * Concave polygon A simple polygon that is not convex is called concave, non-convex or ...

in the true positive rate.

Criticism

David Hand and others criticize the widespread use of the F₁ score since it gives equal importance to precision and recall. In practice, different types of mis-classifications incur different costs. In other words, the relative importance of precision and recall is an aspect of the problem. According to Davide Chicco and Giuseppe Jurman, the F₁ score is less truthful and informative than the Matthews correlation coefficient (MCC) in binary evaluation classification. David M W Powers has pointed out that F₁ ignores the True Negatives and thus is misleading for unbalanced classes, while kappa and correlation measures are symmetric and assess both directions of predictability - the classifier predicting the true class and the true class predicting the classifier prediction, proposing separate multiclass measures

and

Markedness In linguistics and social sciences, markedness is the state of standing out as nontypical or divergent as opposed to regular or common. In a marked–unmarked relation, one term of an opposition is the broader, dominant one. The dominant defau ...

for the two directions, noting that their geometric mean is correlation. Another source of critique of F₁ is its lack of symmetry. It means it may change its value when dataset labeling is changed - the "positive" samples are named "negative" and vice versa. This criticism is met by the P4 metric definition, which is sometimes indicated as a symmetrical extension of F₁. Finally, Ferrer and Dyrland et al. argue that the expected cost (or its counterpart, the expected utility) is the only principled metric for evaluation of classification decisions, having various advantages over the F-score and the MCC. Both works show that the F-score can result in wrong conclusions about the absolute and relative quality of systems.

Difference from Fowlkes–Mallows index

While the F-measure is the

of recall and precision, the Fowlkes–Mallows index is their

geometric mean In mathematics, the geometric mean is a mean or average which indicates a central tendency of a finite collection of positive real numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometri ...

Extension to multi-class classification

The F-score is also used for evaluating classification problems with more than two classes ( Multiclass classification). A common method is to average the F-score over each class, aiming at a balanced measurement of performance.

Macro F1

''Macro F1'' is a macro-averaged F1 score aiming at a balanced performance measurement. To calculate macro F1, two different averaging-formulas have been used: the F1 score of (arithmetic) class-wise precision and recall means or the arithmetic mean of class-wise F1 scores, where the latter exhibits more desirable properties.

Micro F1

''Micro F1'' is the harmonic mean of ''micro precision'' and ''micro recall''. In single-label multi-class classification, micro precision equals micro recall, thus micro F1 is equal to both. However, contrary to a common misconception, micro F1 does not generally equal ''accuracy'', because accuracy takes true negatives into account while micro F1 does not.

References

{{DEFAULTSORT:F1 Score Statistical natural language processing Evaluation of machine translation Statistical ratios Summary statistics for contingency tables Clustering criteria de:Beurteilung eines Klassifikators#Kombinierte Maße