# Why “Accuracy” of a Classification System may be a Useless Metric?

Recently, I received a slide deck extolling the virtue of an exciting new classification system with a purported accuracy of 62.5%. While the number itself is not very high to begin with, the value of that 62.5% begins to diminish further once we evaluate what accuracy really represents. Accuracy is defined as (number of items correctly classified) / (total number).

Suppose the classes are not equally represented, and rather they are represented in a ratio of 2 to 1. That is, class 1 is the right classification for 2/3rd of the items, and the class 2 is the correct classification for 1/3rd of the items. Consider a degenerate classification system that simply assigns class 1 to all items. The accuracy of that degenerate system is then 67%. And that system does not even do anything!

This simply observation is the reason that there are so many other objective functions – for example, kappa statistic, matthews correlation coefficient, F1 measure, etc, that are considered so much more appropriate than the “accuracy”. Kappa statistic, for example, compares the accuracy of the system to the accuracy of a random system.

# Comparing Kappa Statistic to the F1 measure

We looked at Kappa Statistic previously, and I have been evaluating some aspects of it again.

To remind ourselves, Kappa statistic is a measure of consistency amongst different raters, taking into account the agreement occurring by chance. The standard formula for kappa statistic is given as:

Firstly, an observation that I omitted to make: the value of kappa statistic can indeed be negative. The total accuracy can be lesser than random accuracy, and as CMAJ letter by Juurlink and Detsky points out, this may indicate genuine disagreement, or it may reflect a problem in the application of a diagnostic test.

Secondly, one thing to love about Kappa is the following. Consider the case that one actual class is much more prevalent than the other. In such case, a classification system that simply outputs the more prevalent class may have a high F1 measure (a high precision and high recall), but will have a very low value of kappa. For example, consider the scenario that we are asked if it will rain in Seattle and consider the following confusion matrix:

Predicted F T Actual F 1 99 T 1 899

This is a null hypothesis model, in the sense that it almost always predicts the class to be “T”. In the case of this confusion matrix, precision is 0.9008 and recall is 0.9989. F1 measure is 0.9473 and can give an impression that this is a useful model. Kappa value is very low, at 0.0157, and gives a clear enough warning about the validity of this model.

In other words, while it may be easier to predict rain in Seattle (or sunshine in Aruba), kappa statistic tries to take away the bias in the actual distribution, while the F1 measure may not.

# Confusion Matrix – Another Single Value Metric – Kappa Statistic

 Background: This is another in the line of posts on how to compare confusion matrices. The path, as has been taken in the past is in terms of using some aggregate objective function (or single value metric), that takes a confusion matrix and reduces it to one value.

In a previous post, we discussed how Matthews Correlation Coefficient and F1 measure compare with each other, and reward/cost based single value metrics. Another single value metric (or aggregate objective function) that is worth discussing is the Kappa Statistic.

Kappa Statistic compares the accuracy of the system to the accuracy of a random system.  To quote Richard Landis and Gary Koch from the 1977 paper The Measurement of Observer Agreement for Categorical Data, “..(total accuracy) is an observational probability of agreement and (random accuracy) is a hypothetical expected probability of agreement under an appropriate set of baseline constraints.”

Total accuracy is simply the sum of true positive and true negatives, divided by the total number of items, that is:

Random Accuracy is defined as the sum of the products of reference likelihood and result likelihood for each class. That is,

In terms of false positives etc, random accuracy can be written as:

I have taken the previous test case confusion matrices and added the Kappa to that as well. Here is a snapshot.

Two things about kappa statistic that are of further interest:

Firstly, it is a general statistic that can be used for classification systems, not just for targeting systems. Secondly, kappa statistic is normalized statistic, just like MCC. Its value never exceeds one, so the same statistic can be used even as the number of observations grows.

Here is the link to the PDF if that is of interest.