RSS FeedConfusion Matrix – Another Single Value Metric – Kappa Statistic
| Background: This is another in the line of posts on how to compare confusion matrices. The path, as has been taken in the past is in terms of using some aggregate objective function (or single value metric), that takes a confusion matrix and reduces it to one value. |
In a previous post, we have discussed Matthews Correlation Coefficient, F1 measure, and reward/cost based single value metrics. Another single value metric (or aggregate objective function) that is worth discussing is the Kappa Statistic.
Kappa Statistic is interesting in the sense that it actually tries to compare the accuracy of the system to the accuracy of a random system. To quote Richard Landis and Gary Koch from the 1977 paper The Measurement of Observer Agreement for Categorical Data, “..(total accuracy) is an observational probability of agreement and (random accuracy) is a hypothetical expected probability of agreement under an appropriate set of baseline constraints.”
Total accuracy is simply the sum of true positive and true negatives, divided by the total number of items, that is:
Random Accuracy is defined as the sum of the products of reference likelihood and result likelihood for each class. That is,
In terms of false positives etc, random accuracy can be written as:
I have taken the previous test case confusion matrices and added the Kappa to that as well. Here is a snapshot.
Two things about kappa statistic that are of further interest:
Firstly, it is a general statistic that can be used for classification systems, not just for targeting systems. Secondly, kappa statistic is normalized statistic, just like MCC. Its value never exceeds one, so the same statistic can be used even as the number of observations grows.
Here is the link to the PDF if that is of interest.
Data Science DC – Naive Bayes and Logistic Regression
Attending the Data Science DC meetup, will be live blogging..
7:04 PM: First up, we have the introductions and sponsor messages, by Harlan Harris.
7:18 PM: Elena Zheleva, from Living Social, starts the actual presentation. Starts with two examples:
Example 1: Classification of mail to Span or No-Spam
Example 2: Classification of voter to republican/democrat
Talks about features and attributes, and kinds of attributes (continuous, discrete, nominal, etc.)
The basics of Naive Bayes: The idea of Naive Bayes is of course simple enough. We should like to find P(Y | X), where X are the inputs, and Y are the class labels. X is typically composed of many, many attributes, so this may be better written as: P (Y | X1, X2, .. Xn)
Directly finding this would require a very large training set (due to 2^n combinations on binary attributes X1, X2, .. Xn). So, using Bayes theorem, we can rewrite this as:
P (Y | X) = P(X | Y) P (Y) / P(X)
P (X | Y) can be written again as P(X1, X2, .. Xn | Y), and now using the assumption that these attributes are independent (hence the name “Naive”), we can write this as:P (X | Y) = P(X1, X2, .. Xn | Y) = P(X1 | Y) * P(X2 | Y) * … * P(Xn|Y)
Next, she talks about the difference in approach between Naive Bayes and Logistics Regression. The paper by Andrew Ng and Michael Jordan (not that Michael Jordan, but a famous one nevertheless) is a helpful resource in that regard.
Question: How does NB work when the attributes are continuous, not binary.
Answer: If we can assume the distributions are Gaussian (Normal), then we can learn the parameters (sigma and mu). (The Wikipedia article section on Sex Classification contains an example.)

7:54 PM: On to Logistics Regression. Talks about the problem of overfitting which can occur if there are few samples. That was covered under the title of “Regularization” in a separate meetup.
8:13 PM: Time for acknowledgements and list of software available.
After meetup notes:Weka has a good reference implementation of Naive Bayes. Here is a snapshot of one of the examples. (I modified the data file a little bit, so your results may be slightly different.)
PageRank, Android and iPhone – How to use naturally occurring capabilities to your advantage
The implementation of PageRank was a watershed moment in technology that showed us one thing – how naturally occurring data could be used to create an amazingly good score, and thus Crowdsourcing was born. Next up (many years later) was the YouTube (which was a Google acquisition, not an in house technology), but gelled pretty well with their crowdsourcing mentality. Next up (which took much longer to pull about), was the Android, which allowed Google to slowly penetrate a very crowded (no pun intended) cellphone market, again using the power of crowdsourcing. In this however, it had a clear precedent – Apple, although an amazingly unlikely proponent, had basically invented the app market concept for the cellphones, using the very successful iPhone app stores. Google was the one to realize that no matter how sophisticated your cellphone and OS is (BlackBerry), a crowd of spaghetti beats one strong rope, and further, crowdsourcing can only be competed with crowdsourcing. That competition between iOS and Android still goes on, and each of them tries to make their product sticky, by making people dependent not only on the device, but also on the myriad applications that those devices support. Now, BlackBerry is supposed to be coming up with its even better OS, but I wonder how much difference it can make,
considering that their OS was already the best one, in terms of robustness and OS level functionality. Where it got beat was simply by millions of apps, and that is where it clearly has a catch up role to play. How should it plan to compete with Android (or iOS) where millions of dedicated developers are writing interesting applications, and people are writing tutorials and books on how to create those interesting applications?
This idea of using naturally occurring data (or capabilities, if abstracted at one level) is not limited to cellphone market only. Examples abound in many other vertices. For examples, one of the reasons that NX CCS is so successful in integrating logistics data is that it simply uses the data that already exists – bills of lading, shipping notices, tracking information etc. Similarly, the success of TripIt is largely attributable to the fact that they simply use the reservation confirmations that already existed before their product came about. This idea itself can be considered an important ingredient in product stickiness – how much of what the product needs to work already exists? If the answer to that is, not so much, then clearly the idea or the product will have a shorter adoption cycle.
Matthew’s Correlation Coefficient – How Well Does It Do?
Background: This post talks about 2-class classification systems, that is, targeting systems. In targeting system, it is common practice to discuss the performance in terms of confusion matrix, which is a 2×2 matrix consisting of predicted T/F values compared to actual T/F values. The 4 cells of the confusion matrix can be represented as True Negative (Predicted = False, Actual = False), False Positive (Predicted = True, Actual = False), False Negative (Predicted = False, Actual = True) and True Positive (Predicted = True, Actual = True). An example confusion matrix is shown here:
The targeting system (that this confusion matrix represents) has resulted in 800 true negative decisions, 80 true positive decisions, 100 false positive decisions, and 20 false negative decisions. |
|||||||||||||
So, as discussed in a previous post, Matthew’s Correlation Coefficient (MCC) does pretty well to represent a confusion matrix (or, in other words, a targeting system or a model). Of course, MCC is not the only aggregate objective function (AOF) available for a confusion matrix. F1 measure (harmonic mean of recall and precision) is commonly used as well. Third AOF that I have frequently used (and tried to promote) is a reward/cost based function which tries to extract the confusion matrix into a single value as a weighted linear function (where obviously, TP and TN have positive weights and FP and FN have negative weights.)
To make things tangible, let us consider a few example confusion matrices, and then, you decide how it really does. As careful reader will note, there are infinite models (that is, infinite confusion matrices), so this is not an exercise to drive you to any conclusion. Rather, this is merely meant to enable us to compare these three AOFs in a limited sense.
As an example, we consider a sample of 1000 transactions, of which 100 are fraudulent. Suppose 10 different targeting systems are trying to find the fraudulent transactions, and these are the confusion matrices corresponding to these systems.
In terms of definition, we note that MCC = (TP * TN – FP * FN)/sqrt((TP+FP) (TP + TN) (FP + FN) (TN + FN)).
Further, precision is defined as TP/(TP + FP), Recall is defined as TP/(TP + FN) . F1 measure is defined as 2*Recall*Precision/(Recall+Precision). It is a trivial exercise to observe that F1 measure actually cares about what is “positive” versus what is “negative”. For MCC, that distinction is merely semantic, in other words you can switch the meaning of positive and negative, and the MCC value remains the same. That is not true for F1 measure, and IMHO is a drawback of F1 measure. (A turing machine that decides a language L is just as effective at deciding the complement of that language, that is L’ and should not receive a different grade for deciding L as it does for deciding L’.)
For the reward/cost based function, we have used the following values: R1 = Reward for TP = 10, R2 = Reward for TN = 0.1, C1 = Cost of FP = 0.1, C2 = Cost of FN = 10. It is the subject of a separate discussion as to how to select these values appropriately.
As a highlight, consider model 4 and model 6. Model 6 has higher precision, while model 4 has higher recall. MCC slightly favors Model 4. F1 measure slightly favors Model 6. Cost/reward measure clearly favors model 4.
So, as a question for you – if you had to select the model, which of these two models (Model 4 vs. Model 6) would you select?
Apps
