Cohen's kappa coefficient (''κ'', lowercase Greek

kappa Kappa (uppercase Κ, lowercase κ or cursive ; el, κάππα, ''káppa'') is the 10th letter of the Greek alphabet, representing the voiceless velar plosive sound in Ancient and Modern Greek. In the system of Greek numerals, has a value ...

) is a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...

that is used to measure inter-rater reliability (and also

intra-rater reliability In statistics, intra-rater reliability is the degree of agreement among repeated administrations of a diagnostic test performed by a single rater. Intra-rater reliability and inter-rater reliability are aspects of test validity. See also * In ...

) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as ''κ'' takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

History

The first mention of a kappa-like statistic is attributed to Galton in 1892. The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal ''Educational and Psychological Measurement'' in 1960.

Definition

Cohen's kappa measures the agreement between two raters who each classify ''N'' items into ''C'' mutually exclusive categories. The definition of

\kappa

is :

\kappa \equiv \frac = 1- \frac,

where is the relative observed agreement among raters, and is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then

\kappa=1

. If there is no agreement among the raters other than what would be expected by chance (as given by ),

\kappa=0

. It is possible for the statistic to be negative, which can occur by chance if there is no relationship between the ratings of the two raters, or it may reflect a real tendency of the raters to give differing ratings. For categories, observations to categorize and

n_

the number of times rater predicted category : :

p_e = \frac \sum_k n_n_

This is derived from the following construction: :

p_e = \sum_k \widehat = \sum_k \widehat\widehat = \sum_k \frac\frac = \frac \sum_k n_n_

Where

\widehat

is the estimated probability that both rater 1 and rater 2 will classify the same item as k, while

\widehat

is the estimated probability that rater 1 will classify an item as k (and similarly for rater 2). The relation

\widehat = \sum_k \widehat\widehat

is based on using the assumption that the rating of the two raters are

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independe ...

. The term

\widehat

is estimated by using the number of items classified as k by rater 1 (

n_

) divided by the total items to classify (

N

\widehat=

(and similarly for rater 2).

Binary classification confusion matrix

In the traditional 2 × 2 confusion matrix employed in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and

statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

to evaluate

binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...

s, the Cohen's Kappa formula can be written as: :

\kappa = \frac

where TP are the true positives, FP are the false positives, TN are the true negatives, and FN are the false negatives. In this case, Cohen's Kappa is equivalent to the ''Heidke skill score'' known in

Meteorology Meteorology is a branch of the atmospheric sciences (which include atmospheric chemistry and physics) with a major focus on weather forecasting. The study of meteorology dates back millennia, though significant progress in meteorology did no ...

. The measure was first introduced by Myrick Haskell Doolittle in 1888.

Examples

Simple example

Suppose that you were analyzing data related to a group of 50 people applying for a grant. Each grant proposal was read by two readers and each reader either said "Yes" or "No" to the proposal. Suppose the disagreement count data were as follows, where A and B are readers, data on the main diagonal of the matrix (a and d) count the number of agreements and off-diagonal data (b and c) count the number of disagreements: e.g. The observed proportionate agreement is: :

p_o = \frac = \frac = 0.7

To calculate (the probability of random agreement) we note that: * Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time. * Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time. So the expected probability that both would say yes at random is: :

p_\text = \frac \cdot \frac = 0.5 \times 0.6 = 0.3

Similarly: :

p_\text = \frac \cdot \frac = 0.5 \times 0.4 = 0.2

Overall random agreement probability is the probability that they agreed on either Yes or No, i.e.: :

p_e = p_\text + p_\text = 0.3 + 0.2 = 0.5

So now applying our formula for Cohen's Kappa we get: :

\kappa = \frac = \frac = 0.4

Same percentages but different numbers

A case sometimes considered to be a problem with Cohen's Kappa occurs when comparing the Kappa calculated for two pairs of raters with the two raters in each pair having the same percentage agreement but one pair give a similar number of ratings in each class while the other pair give a very different number of ratings in each class. (In the cases below, notice B has 70 yeses and 30 nos, in the first case, but those numbers are reversed in the second.) For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) in terms of agreement in each class, so we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each: :

\kappa = \frac = 0.1304

\kappa = \frac = 0.2593

we find that it shows greater similarity between A and B in the second case, compared to the first. This is because while the percentage agreement is the same, the percentage agreement that would occur 'by chance' is significantly higher in the first case (0.54 compared to 0.46).

Properties

Hypothesis testing and confidence interval

P-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...

for kappa is rarely reported, probably because even relatively low values of kappa can nonetheless be significantly different from zero but not of sufficient magnitude to satisfy investigators. Still, its standard error has been described and is computed by various computer programs. Confidence intervals for Kappa may be constructed, for the expected Kappa values if we had infinite number of items checked, using the following formula: :

CI: \kappa \pm Z_ SE_

Where

Z_ = 1.960

is the standard normal percentile when

\alpha = 5\%

, and

SE_ = \sqrt

This is calculated by ignoring that is estimated from the data, and by treating as an estimated probability of a

binomial distribution In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no ques ...

while using asymptotic normality (i.e.: assuming that the number of items is large and that is not close to either 0 or 1).

SE_

(and the CI in general) may also be estimated using bootstrap methods.

Interpreting magnitude

If statistical significance is not a useful guide, what magnitude of kappa reflects adequate agreement? Guidelines would be helpful, but factors other than agreement can influence its magnitude, which makes interpretation of a given magnitude problematic. As Sim and Wright noted, two important factors are prevalence (are the codes equiprobable or do their probabilities vary) and bias (are the marginal probabilities for the two observers similar or different). Other things being equal, kappas are higher when codes are equiprobable. On the other hand, Kappas are higher when codes are distributed asymmetrically by the two observers. In contrast to probability variations, the effect of bias is greater when Kappa is small than when it is large. Another factor is the number of codes. As number of codes increases, kappas become higher. Based on a simulation study, Bakeman and colleagues concluded that for fallible observers, values for kappa were lower when codes were fewer. And, in agreement with Sim & Wrights's statement concerning prevalence, kappas were higher when codes were roughly equiprobable. Thus Bakeman et al. concluded that "no one value of kappa can be regarded as universally acceptable." They also provide a computer program that lets users compute values for kappa specifying number of codes, their probability, and observer accuracy. For example, given equiprobable codes and observers who are 85% accurate, value of kappa are 0.49, 0.60, 0.66, and 0.69 when number of codes is 2, 3, 5, and 10, respectively. Nonetheless, magnitude guidelines have appeared in the literature. Perhaps the first was Landis and Koch, who characterized values < 0 as indicating no agreement and 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement. This set of guidelines is however by no means universally accepted; Landis and Koch supplied no evidence to support it, basing it instead on personal opinion. It has been noted that these guidelines may be more harmful than helpful. Fleiss's equally arbitrary guidelines characterize kappas over 0.75 as excellent, 0.40 to 0.75 as fair to good, and below 0.40 as poor.

Kappa maximum

Kappa assumes its theoretical maximum value of 1 only when both observers distribute codes the same, that is, when corresponding row and column sums are identical. Anything less is less than perfect agreement. Still, the maximum value kappa could achieve given unequal distributions helps interpret the value of kappa actually obtained. The equation for ''κ'' maximum is: :

\kappa_ =\frac

where

P_ = \sum_^k P_P_

, as usual,

P_ = \sum_^k \min(P_,P_)

, ''k'' = number of codes,

P_

are the row probabilities, and

P_

are the column probabilities.

Limitations

Kappa is an index that considers observed agreement with respect to a baseline agreement. However, investigators must consider carefully whether Kappa's baseline agreement is relevant for the particular research question. Kappa's baseline is frequently described as the agreement due to chance, which is only partially correct. Kappa's baseline agreement is the agreement that would be expected due to random allocation, given the quantities specified by the marginal totals of square contingency table. Thus, κ = 0 when the observed allocation is apparently random, regardless of the quantity disagreement as constrained by the marginal totals. However, for many applications, investigators should be more interested in the quantity disagreement in the marginal totals than in the allocation disagreement as described by the additional information on the diagonal of the square contingency table. Thus for many applications, Kappa's baseline is more distracting than enlightening. Consider the following example: The disagreement proportion is 14/16 or 0.875. The disagreement is due to quantity because allocation is optimal. κ is 0.01. The disagreement proportion is 2/16 or 0.125. The disagreement is due to allocation because quantities are identical. Kappa is −0.07. Here, reporting quantity and allocation disagreement is informative while Kappa obscures information. Furthermore, Kappa introduces some challenges in calculation and interpretation because Kappa is a ratio. It is possible for Kappa's ratio to return an undefined value due to zero in the denominator. Furthermore, a ratio does not reveal its numerator nor its denominator. It is more informative for researchers to report disagreement in two components, quantity and allocation. These two components describe the relationship between the categories more clearly than a single summary statistic. When predictive accuracy is the goal, researchers can more easily begin to think about ways to improve a prediction by using two components of quantity and allocation, rather than one ratio of Kappa.Some researchers have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can make it unreliable for measuring agreement in situations such as the diagnosis of rare diseases. In these situations, κ tends to underestimate the agreement on the rare category. For this reason, κ is considered an overly conservative measure of agreement. Others contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario. Moreover, some works have shown how kappa statistics can lead to a wrong conclusion for unbalanced data.

Related statistics

Scott's Pi

A similar statistic, called pi, was proposed by Scott (1955). Cohen's kappa and Scott's pi differ in terms of how is calculated.

Fleiss' kappa

Note that Cohen's kappa measures agreement between two raters only. For a similar measure of agreement ( Fleiss' kappa) used when there are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic, not Cohen's kappa. Kappa is also used to compare performance in

, but the directional version known as Informedness or Youden's J statistic is argued to be more appropriate for supervised learning.

Weighted kappa

The weighted kappa allows disagreements to be weighted differently and is especially useful when codes are ordered. Three matrices are involved, the matrix of observed scores, the matrix of expected scores based on chance agreement, and the weight matrix. Weight matrix cells located on the diagonal (upper-left to bottom-right) represent agreement and thus contain zeros. Off-diagonal cells contain weights indicating the seriousness of that disagreement. Often, cells one off the diagonal are weighted 1, those two off 2, etc. The equation for weighted κ is: :

\kappa = 1 - \frac

where ''k''=number of codes and

w_

x_