RSS FeedConfusion Matrix – Another Single Value Metric – Kappa Statistic
| Background: This is another in the line of posts on how to compare confusion matrices. The path, as has been taken in the past is in terms of using some aggregate objective function (or single value metric), that takes a confusion matrix and reduces it to one value. |
In a previous post, we have discussed Matthews Correlation Coefficient, F1 measure, and reward/cost based single value metrics. Another single value metric (or aggregate objective function) that is worth discussing is the Kappa Statistic.
Kappa Statistic is interesting in the sense that it actually tries to compare the accuracy of the system to the accuracy of a random system. To quote Richard Landis and Gary Koch from the 1977 paper The Measurement of Observer Agreement for Categorical Data, “..(total accuracy) is an observational probability of agreement and (random accuracy) is a hypothetical expected probability of agreement under an appropriate set of baseline constraints.”
Total accuracy is simply the sum of true positive and true negatives, divided by the total number of items, that is:
Random Accuracy is defined as the sum of the products of reference likelihood and result likelihood for each class. That is,
In terms of false positives etc, random accuracy can be written as:
I have taken the previous test case confusion matrices and added the Kappa to that as well. Here is a snapshot.
Two things about kappa statistic that are of further interest:
Firstly, it is a general statistic that can be used for classification systems, not just for targeting systems. Secondly, kappa statistic is normalized statistic, just like MCC. Its value never exceeds one, so the same statistic can be used even as the number of observations grows.
Here is the link to the PDF if that is of interest.
Data Science DC – Naive Bayes and Logistic Regression
Attending the Data Science DC meetup, will be live blogging..
7:04 PM: First up, we have the introductions and sponsor messages, by Harlan Harris.
7:18 PM: Elena Zheleva, from Living Social, starts the actual presentation. Starts with two examples:
Example 1: Classification of mail to Span or No-Spam
Example 2: Classification of voter to republican/democrat
Talks about features and attributes, and kinds of attributes (continuous, discrete, nominal, etc.)
The basics of Naive Bayes: The idea of Naive Bayes is of course simple enough. We should like to find P(Y | X), where X are the inputs, and Y are the class labels. X is typically composed of many, many attributes, so this may be better written as: P (Y | X1, X2, .. Xn)
Directly finding this would require a very large training set (due to 2^n combinations on binary attributes X1, X2, .. Xn). So, using Bayes theorem, we can rewrite this as:
P (Y | X) = P(X | Y) P (Y) / P(X)
P (X | Y) can be written again as P(X1, X2, .. Xn | Y), and now using the assumption that these attributes are independent (hence the name “Naive”), we can write this as:P (X | Y) = P(X1, X2, .. Xn | Y) = P(X1 | Y) * P(X2 | Y) * … * P(Xn|Y)
Next, she talks about the difference in approach between Naive Bayes and Logistics Regression. The paper by Andrew Ng and Michael Jordan (not that Michael Jordan, but a famous one nevertheless) is a helpful resource in that regard.
Question: How does NB work when the attributes are continuous, not binary.
Answer: If we can assume the distributions are Gaussian (Normal), then we can learn the parameters (sigma and mu). (The Wikipedia article section on Sex Classification contains an example.)

7:54 PM: On to Logistics Regression. Talks about the problem of overfitting which can occur if there are few samples. That was covered under the title of “Regularization” in a separate meetup.
8:13 PM: Time for acknowledgements and list of software available.
After meetup notes:Weka has a good reference implementation of Naive Bayes. Here is a snapshot of one of the examples. (I modified the data file a little bit, so your results may be slightly different.)
Matthew’s Correlation Coefficient – How Well Does It Do?
Background: This post talks about 2-class classification systems, that is, targeting systems. In targeting system, it is common practice to discuss the performance in terms of confusion matrix, which is a 2×2 matrix consisting of predicted T/F values compared to actual T/F values. The 4 cells of the confusion matrix can be represented as True Negative (Predicted = False, Actual = False), False Positive (Predicted = True, Actual = False), False Negative (Predicted = False, Actual = True) and True Positive (Predicted = True, Actual = True). An example confusion matrix is shown here:
The targeting system (that this confusion matrix represents) has resulted in 800 true negative decisions, 80 true positive decisions, 100 false positive decisions, and 20 false negative decisions. |
|||||||||||||
So, as discussed in a previous post, Matthew’s Correlation Coefficient (MCC) does pretty well to represent a confusion matrix (or, in other words, a targeting system or a model). Of course, MCC is not the only aggregate objective function (AOF) available for a confusion matrix. F1 measure (harmonic mean of recall and precision) is commonly used as well. Third AOF that I have frequently used (and tried to promote) is a reward/cost based function which tries to extract the confusion matrix into a single value as a weighted linear function (where obviously, TP and TN have positive weights and FP and FN have negative weights.)
To make things tangible, let us consider a few example confusion matrices, and then, you decide how it really does. As careful reader will note, there are infinite models (that is, infinite confusion matrices), so this is not an exercise to drive you to any conclusion. Rather, this is merely meant to enable us to compare these three AOFs in a limited sense.
As an example, we consider a sample of 1000 transactions, of which 100 are fraudulent. Suppose 10 different targeting systems are trying to find the fraudulent transactions, and these are the confusion matrices corresponding to these systems.
In terms of definition, we note that MCC = (TP * TN – FP * FN)/sqrt((TP+FP) (TP + TN) (FP + FN) (TN + FN)).
Further, precision is defined as TP/(TP + FP), Recall is defined as TP/(TP + FN) . F1 measure is defined as 2*Recall*Precision/(Recall+Precision). It is a trivial exercise to observe that F1 measure actually cares about what is “positive” versus what is “negative”. For MCC, that distinction is merely semantic, in other words you can switch the meaning of positive and negative, and the MCC value remains the same. That is not true for F1 measure, and IMHO is a drawback of F1 measure. (A turing machine that decides a language L is just as effective at deciding the complement of that language, that is L’ and should not receive a different grade for deciding L as it does for deciding L’.)
For the reward/cost based function, we have used the following values: R1 = Reward for TP = 10, R2 = Reward for TN = 0.1, C1 = Cost of FP = 0.1, C2 = Cost of FN = 10. It is the subject of a separate discussion as to how to select these values appropriately.
As a highlight, consider model 4 and model 6. Model 6 has higher precision, while model 4 has higher recall. MCC slightly favors Model 4. F1 measure slightly favors Model 6. Cost/reward measure clearly favors model 4.
So, as a question for you – if you had to select the model, which of these two models (Model 4 vs. Model 6) would you select?
Big Data DC Meetup #4 – Kafka
Live Blogging from the Big Data DC Meetup# 4 – Chris Burroughs is presenting on Kafka.
Essentially, Kafka is a distributed publish-subscribe messaging system. It provides persistent messaging (that is, protection against restarts and shut downs. It also provides a constant time, that is, O(1) disk structures that provide constant time performance even with many TB of stored messages. (This is sort of impressive as is already).
The main advantage of Kafka seems to be the high-throughput: even using simple hardware Kafka can support hundreds of thousands of messages per second.
Leading Financial Institution with Nice GUI and No Transaction Analysis
Here is my full Bank of America screenshot. Let everyone be aware that I use BoA, I am a Novec customer, and how much I spend each month on electricity (although, although, there is a twist to that).
As you can notice, using a spiffy UI, I can easily find transactions using a keyword (Novec). That is cool as ice. I like that. (In Jim Carey voice: I like that a lot.)
However, do notice two payments going out for each bill that I receive. One might think that using some advanced (or rudimentary) transaction analysis, the bank might find that to be rather odd. (Might is the operating word in the previous sentence, used twice for effect.) Data anomaly techniques have been around for a while. Still, this goes on.
Ok, now back to my tease from the beginning – my electricity spending each month has not been disclosed by these numbers. Why? Well, the way it works is that although the payment is being made twice, the receiver Novec, does credit it correctly, therefore bringing some credit balance forward each month. (If you like a Math problem, you can try to solve for the actual numbers.
)
Still, the lack of simple transaction analysis by a leading financial institution is rather surprising.
Apps


