Background: This post talks about 2-class classification systems, that is, targeting systems. In targeting system, it is common practice to discuss the performance in terms of confusion matrix, which is a 2×2 matrix consisting of predicted T/F values compared to actual T/F values. The 4 cells of the confusion matrix can be represented as True Negative (Predicted = False, Actual = False), False Positive (Predicted = True, Actual = False), False Negative (Predicted = False, Actual = True) and True Positive (Predicted = True, Actual = True). An example confusion matrix is shown here:
|
Predicted |
| F |
T |
| Actual |
F |
800 |
100 |
| T |
20 |
80 |
The targeting system (that this confusion matrix represents) has resulted in 800 true negative decisions, 80 true positive decisions, 100 false positive decisions, and 20 false negative decisions. |
So, as discussed in a previous post, Matthew’s Correlation Coefficient (MCC) does pretty well to represent a confusion matrix (or, in other words, a targeting system or a model). Of course, MCC is not the only aggregate objective function (AOF) available for a confusion matrix. F1 measure (harmonic mean of recall and precision) is commonly used as well. Third AOF that I have frequently used (and tried to promote) is a reward/cost based function which tries to extract the confusion matrix into a single value as a weighted linear function (where obviously, TP and TN have positive weights and FP and FN have negative weights.)
To make things tangible, let us consider a few example confusion matrices, and then, you decide how it really does. As careful reader will note, there are infinite models (that is, infinite confusion matrices), so this is not an exercise to drive you to any conclusion. Rather, this is merely meant to enable us to compare these three AOFs in a limited sense.
As an example, we consider a sample of 1000 transactions, of which 100 are fraudulent. Suppose 10 different targeting systems are trying to find the fraudulent transactions, and these are the confusion matrices corresponding to these systems.

Comparison of 3 Aggregate Objective Functions for Confusion Matrix
In terms of definition, we note that MCC = (TP * TN – FP * FN)/sqrt((TP+FP) (TP + TN) (FP + FN) (TN + FN)).
Further, precision is defined as TP/(TP + FP), Recall is defined as TP/(TP + FN) . F1 measure is defined as 2*Recall*Precision/(Recall+Precision). It is a trivial exercise to observe that F1 measure actually cares about what is “positive” versus what is “negative”. For MCC, that distinction is merely semantic, in other words you can switch the meaning of positive and negative, and the MCC value remains the same. That is not true for F1 measure, and IMHO is a drawback of F1 measure. (A turing machine that decides a language L is just as effective at deciding the complement of that language, that is L’ and should not receive a different grade for deciding L as it does for deciding L’.)
For the reward/cost based function, we have used the following values: R1 = Reward for TP = 10, R2 = Reward for TN = 0.1, C1 = Cost of FP = 0.1, C2 = Cost of FN = 10. It is the subject of a separate discussion as to how to select these values appropriately.
As a highlight, consider model 4 and model 6. Model 6 has higher precision, while model 4 has higher recall. MCC slightly favors Model 4. F1 measure slightly favors Model 6. Cost/reward measure clearly favors model 4.
So, as a question for you – if you had to select the model, which of these two models (Model 4 vs. Model 6) would you select?