Apps Contact Seminars

April 22nd, 2013

## Comparing Kappa Statistic to the F1 measure

We looked at Kappa Statistic previously, and I have been evaluating some aspects of it again.

To remind ourselves, Kappa statistic is a measure of consistency amongst different raters, taking into account the agreement occurring by chance. The standard formula for kappa statistic is given as:

Firstly, an observation that I omitted to make: the value of kappa statistic can indeed be negative. The total accuracy can be lesser than random accuracy, and as CMAJ letter by Juurlink and Detsky points out, this may indicate genuine disagreement, or it may reflect a problem in the application of a diagnostic test.

Secondly, one thing to love about Kappa is the following. Consider the case that one actual class is much more prevalent than the other. In such case, a classification system that simply outputs the more prevalent class may have a high F1 measure (a high precision and high recall), but will have a very low value of kappa. For example, consider the scenario that we are asked if it will rain in Seattle and consider the following confusion matrix:

 Predicted F T Actual F 1 99 T 1 899

This is a null hypothesis model, in the sense that it almost always predicts the class to be “T”. In the case of this confusion matrix, precision is 0.9008 and recall is 0.9989. F1 measure is 0.9473 and can give an impression that this is a useful model. Kappa value is very low, at 0.0157, and gives a clear enough warning about the validity of this model.

In other words, while it may be easier to predict rain in Seattle (or sunshine in Aruba), kappa statistic tries to take away the bias in the actual distribution, while the F1 measure may not.

January 30th, 2012

## Comparing two confusion matrices

Comparing two confusion matrices is a standard approach for comparing the respective targeting systems, but by no means is it the only one. As we will discuss in the coming days, you can also compare two score based targeting systems by comparing their lists. But for now, let us focus on comparing the targeting systems by comparing their respective confusion matrices.

The standard approach is to use a single value metric to reduce each matrix into one value, and then to compare the metric values.  In other words, to compare M1 and M2, we simply compare f(M1) and f(M2), where function f is the single value metric.

Here are some single value metrics that can be considered as candidates:

1. Kappa Statistic
2. F1 measure
3. Matthews Correlation Coefficient
4. Reward/Cost based
5. Sensitivity (Recall)
6. Specificity (Precision)

A related approach can also be to take a matrix difference of the two matrices, and then using a dot product (or scalar product), but it is easy to see that transforms to using reward/cost based metric.  [A.C - B.C = (A-B).C etc.]

December 29th, 2011

## Confusion Matrix – Another Single Value Metric – Kappa Statistic

 Background: This is another in the line of posts on how to compare confusion matrices. The path, as has been taken in the past is in terms of using some aggregate objective function (or single value metric), that takes a confusion matrix and reduces it to one value.

In a previous post, we have discussed Matthews Correlation Coefficient, F1 measure, and reward/cost based single value metrics. Another single value metric (or aggregate objective function) that is worth discussing is the Kappa Statistic.

Kappa Statistic is interesting in the sense that it actually tries to compare the accuracy of the system to the accuracy of a random system.  To quote Richard Landis and Gary Koch from the 1977 paper The Measurement of Observer Agreement for Categorical Data, “..(total accuracy) is an observational probability of agreement and (random accuracy) is a hypothetical expected probability of agreement under an appropriate set of baseline constraints.”

Total accuracy is simply the sum of true positive and true negatives, divided by the total number of items, that is:

Random Accuracy is defined as the sum of the products of reference likelihood and result likelihood for each class. That is,

In terms of false positives etc, random accuracy can be written as:

I have taken the previous test case confusion matrices and added the Kappa to that as well. Here is a snapshot.

Two things about kappa statistic that are of further interest:

Firstly, it is a general statistic that can be used for classification systems, not just for targeting systems. Secondly, kappa statistic is normalized statistic, just like MCC. Its value never exceeds one, so the same statistic can be used even as the number of observations grows.

Here is the link to the PDF if that is of interest.

December 13th, 2011

## Matthew’s Correlation Coefficient – How Well Does It Do?

Background: This post talks about 2-class classification systems, that is, targeting systems. In targeting system, it is common practice to discuss the performance in terms of confusion matrix, which is a 2×2 matrix consisting of predicted T/F values compared to actual T/F values. The 4 cells of the confusion matrix can be represented as True Negative (Predicted = False, Actual = False), False Positive (Predicted = True, Actual = False), False Negative (Predicted = False, Actual = True) and True Positive (Predicted = True, Actual = True).  An example confusion matrix is shown here:

 Predicted F T Actual F 800 100 T 20 80

The targeting system (that this confusion matrix represents) has resulted in 800 true negative decisions, 80 true positive decisions, 100 false positive decisions, and 20 false negative decisions.

So, as discussed in a previous post, Matthew’s Correlation Coefficient (MCC) does pretty well to represent a confusion matrix (or, in other words, a targeting system or a model).  Of course, MCC is not the only aggregate objective function (AOF) available for a confusion matrix.  F1 measure (harmonic mean of recall and precision) is commonly used as well.  Third AOF that I have frequently used (and tried to promote) is a reward/cost based function which tries to extract the confusion matrix into a single value as a weighted linear function (where obviously, TP and TN have positive weights and FP and FN have negative weights.)

To make things tangible, let us consider a few example confusion matrices, and then, you decide how it really does.  As careful reader will note, there are infinite models (that is, infinite confusion matrices), so this is not an exercise to drive you to any conclusion.  Rather, this is merely meant to enable us to compare these three AOFs in a limited sense.

As an example, we consider a sample of 1000 transactions, of which 100 are fraudulent.  Suppose 10 different targeting systems are trying to find the fraudulent transactions, and these are the confusion matrices corresponding to these systems.

Comparison of 3 Aggregate Objective Functions for Confusion Matrix

In terms of definition, we note that  MCC = (TP * TN – FP * FN)/sqrt((TP+FP) (TP + TN) (FP + FN) (TN + FN)).

Further, precision is defined as TP/(TP + FP), Recall is defined as TP/(TP + FN) .  F1 measure is defined as 2*Recall*Precision/(Recall+Precision).  It is a trivial exercise to observe that F1 measure actually cares about what is “positive” versus what is “negative”.  For MCC, that distinction is merely semantic, in other words you can switch the meaning of positive and negative, and the MCC value remains the same.  That is not true for F1 measure, and IMHO is a drawback of F1 measure.  (A turing machine that decides a language L is just as effective at deciding the complement of that language, that is L’ and should not receive a different grade for deciding L as it does for deciding L’.)

For the reward/cost based function, we have used the following values: R1 = Reward for TP = 10, R2 = Reward for TN = 0.1, C1 = Cost of FP = 0.1, C2 = Cost of FN = 10.  It is the subject of a separate discussion as to how to select these values appropriately.

As a highlight, consider model 4 and model 6.  Model 6 has higher precision, while model 4 has higher recall.  MCC slightly favors Model 4.  F1 measure slightly favors Model 6.  Cost/reward measure clearly favors model 4.

So, as a question for you – if you had to select the model, which of these two models (Model 4 vs. Model 6) would you select?

June 29th, 2010

## Matthews correlation coefficient

Like many other researchers, I have struggled with the holy grail of representing the confusion matrix with a single value.  Surely, it may be easy to compare two confusion matrices, for example, you can say the confusion matrix 2 is better than confusion matrix 1, below.

These two confusion matrices are trivially comparable confusion matrices.  Confusion matrix 2 is better than confusion matrix 1, implying that the targeting system underlying confusion matrix 2 is better than the targeting system underlying the confusion matrix 1.

More confusion (pun intended) arises, when two confusion matrices are not trivially comparable.  What are we to do in that case?  Firstly, let us give these a name and a definition.

Definition 1: Two confusion matrices C1 and C2 are trivially comparable if and only if :

(FP(C1) <= FP(C2) and FN(C1) <= FN(C2)) or (FP(C2) <= FP(C1) and FN(C2) <= FN(C1)).

The matrix with lower number of false positives and false negatives can then be called a safely better confusion matrix.

So, back to the discussion of what can we do if two confusion matrices are not trivially comparable?  How can we compare them then?

F-measure, or harmonic mean of recall and precision is a good example of such a measure, but it is woefully inadequate in specific vertical that I operate in.  I had many other “home-grown” measures, which are home grown for a reason – they haven’t had the full academic review yet.  More recently I have come across Matthews correlation coefficient, and am sometimes amazed at how nicely it represents the confusion matrices.

No single metric works in all situations, but this one comes pretty darn close.

Switch to our mobile site