Background: This is another in the line of posts on how to compare confusion matrices. The path, as has been taken in the past is in terms of using some aggregate objective function (or single value metric), that takes a confusion matrix and reduces it to one value. |

In a previous post, we discussed how Matthews Correlation Coefficient and F1 measure compare with each other, and reward/cost based single value metrics. Another single value metric (or aggregate objective function) that is worth discussing is the Kappa Statistic.

Kappa Statistic compares the accuracy of the system to the accuracy of a random system. To quote Richard Landis and Gary Koch from the 1977 paper The Measurement of Observer Agreement for Categorical Data, “..(total accuracy) is an observational probability of agreement and (random accuracy) is a hypothetical expected probability of agreement under an appropriate set of baseline constraints.”

Total accuracy is simply the sum of true positive and true negatives, divided by the total number of items, that is:

Random Accuracy is defined as the sum of the products of reference likelihood and result likelihood for each class. That is,

In terms of false positives etc, random accuracy can be written as:

I have taken the previous test case confusion matrices and added the Kappa to that as well. Here is a snapshot.

Two things about kappa statistic that are of further interest:

Firstly, it is a general statistic that can be used for classification systems, not just for targeting systems. Secondly, kappa statistic is normalized statistic, just like MCC. Its value never exceeds one, so the same statistic can be used even as the number of observations grows.

Here is the link to the PDF if that is of interest.