Apps  Contact  Seminars 

Matthew’s Correlation Coefficient – How Well Does It Do?

by Amrinder
Background: This post talks about 2-class classification systems, that is, targeting systems. In targeting system, it is common practice to discuss the performance in terms of confusion matrix, which is a 2×2 matrix consisting of predicted T/F values compared to actual T/F values. The 4 cells of the confusion matrix can be represented as True Negative (Predicted = False, Actual = False), False Positive (Predicted = True, Actual = False), False Negative (Predicted = False, Actual = True) and True Positive (Predicted = True, Actual = True).  An example confusion matrix is shown here:

Predicted
F T
Actual F 800 100
T 20 80

The targeting system (that this confusion matrix represents) has resulted in 800 true negative decisions, 80 true positive decisions, 100 false positive decisions, and 20 false negative decisions.

So, as discussed in a previous post, Matthew’s Correlation Coefficient (MCC) does pretty well to represent a confusion matrix (or, in other words, a targeting system or a model).  Of course, MCC is not the only aggregate objective function (AOF) available for a confusion matrix.  F1 measure (harmonic mean of recall and precision) is commonly used as well.  Third AOF that I have frequently used (and tried to promote) is a reward/cost based function which tries to extract the confusion matrix into a single value as a weighted linear function (where obviously, TP and TN have positive weights and FP and FN have negative weights.)

To make things tangible, let us consider a few example confusion matrices, and then, you decide how it really does.  As careful reader will note, there are infinite models (that is, infinite confusion matrices), so this is not an exercise to drive you to any conclusion.  Rather, this is merely meant to enable us to compare these three AOFs in a limited sense.

As an example, we consider a sample of 1000 transactions, of which 100 are fraudulent.  Suppose 10 different targeting systems are trying to find the fraudulent transactions, and these are the confusion matrices corresponding to these systems.

Comparison of 3 Aggregate Objective Functions for Confusion Matrix

In terms of definition, we note that  MCC = (TP * TN – FP * FN)/sqrt((TP+FP) (TP + TN) (FP + FN) (TN + FN)).

Further, precision is defined as TP/(TP + FP), Recall is defined as TP/(TP + FN) .  F1 measure is defined as 2*Recall*Precision/(Recall+Precision).  It is a trivial exercise to observe that F1 measure actually cares about what is “positive” versus what is “negative”.  For MCC, that distinction is merely semantic, in other words you can switch the meaning of positive and negative, and the MCC value remains the same.  That is not true for F1 measure, and IMHO is a drawback of F1 measure.  (A turing machine that decides a language L is just as effective at deciding the complement of that language, that is L’ and should not receive a different grade for deciding L as it does for deciding L’.)

For the reward/cost based function, we have used the following values: R1 = Reward for TP = 10, R2 = Reward for TN = 0.1, C1 = Cost of FP = 0.1, C2 = Cost of FN = 10.  It is the subject of a separate discussion as to how to select these values appropriately.

As a highlight, consider model 4 and model 6.  Model 6 has higher precision, while model 4 has higher recall.  MCC slightly favors Model 4.  F1 measure slightly favors Model 6.  Cost/reward measure clearly favors model 4.

So, as a question for you – if you had to select the model, which of these two models (Model 4 vs. Model 6) would you select?

Enter Comments


2 Comments to “Matthew’s Correlation Coefficient – How Well Does It Do?”

  1. To select a model (4 or 6 or any other) we need to be clear on objectives. For example, if there is a high safety issue in cases of false negatives, then Recall (also called Sensitivity) is very important. Thus we would favor Model 4. If we care about correctly identifying negatives (example: not predict anyone healthy as sick), then another measure is important, the Specificity. In that case we would favor Model 6 (by a little bit .944 vs 0.917). If we are confused on objectives (which we mostly are :) ) then Model 4 is better, it gains lots of Sensitivity, by losing a bit of Specificity and other measures. Thus, there is no secret formula that will solve the problem – we just need to be clear on what we want.

  2. I’m looking at ways to evaluate the results of entity extraction engines against document sets but I’m worried about the potential scale of the value of TN in the MCC. The other three values are pretty straightforward — you count what the engine finds and tags, or misses — even for a fairly complex concept-based model. Do I assume that the value of TN for documents that are classified correctly is 1, which could under-represent TN, or would it be the total number of words in the document, which could over-represent TN in relation to the other values. Any suggestions?


Switch to our mobile site