Apps Contact Seminars

April 22nd, 2013

Comparing Kappa Statistic to the F1 measure

We looked at Kappa Statistic previously, and I have been evaluating some aspects of it again.

To remind ourselves, Kappa statistic is a measure of consistency amongst different raters, taking into account the agreement occurring by chance. The standard formula for kappa statistic is given as:

Firstly, an observation that I omitted to make: the value of kappa statistic can indeed be negative. The total accuracy can be lesser than random accuracy, and as CMAJ letter by Juurlink and Detsky points out, this may indicate genuine disagreement, or it may reflect a problem in the application of a diagnostic test.

Secondly, one thing to love about Kappa is the following. Consider the case that one actual class is much more prevalent than the other. In such case, a classification system that simply outputs the more prevalent class may have a high F1 measure (a high precision and high recall), but will have a very low value of kappa. For example, consider the scenario that we are asked if it will rain in Seattle and consider the following confusion matrix:

 Predicted F T Actual F 1 99 T 1 899

This is a null hypothesis model, in the sense that it almost always predicts the class to be “T”. In the case of this confusion matrix, precision is 0.9008 and recall is 0.9989. F1 measure is 0.9473 and can give an impression that this is a useful model. Kappa value is very low, at 0.0157, and gives a clear enough warning about the validity of this model.

In other words, while it may be easier to predict rain in Seattle (or sunshine in Aruba), kappa statistic tries to take away the bias in the actual distribution, while the F1 measure may not.

January 30th, 2012

Comparing two confusion matrices

Comparing two confusion matrices is a standard approach for comparing the respective targeting systems, but by no means is it the only one. As we will discuss in the coming days, you can also compare two score based targeting systems by comparing their lists. But for now, let us focus on comparing the targeting systems by comparing their respective confusion matrices.

The standard approach is to use a single value metric to reduce each matrix into one value, and then to compare the metric values.  In other words, to compare M1 and M2, we simply compare f(M1) and f(M2), where function f is the single value metric.

Here are some single value metrics that can be considered as candidates:

1. Kappa Statistic
2. F1 measure
3. Matthews Correlation Coefficient
4. Reward/Cost based
5. Sensitivity (Recall)
6. Specificity (Precision)

A related approach can also be to take a matrix difference of the two matrices, and then using a dot product (or scalar product), but it is easy to see that transforms to using reward/cost based metric.  [A.C - B.C = (A-B).C etc.]

January 16th, 2012

Targeting Systems vs. Classification Systems

It is generally said that targeting system is a degenerate classification system with only two labels.  However, this is misleading.  For example, consider that you are trying to identify a list of customers who are good prospects for an upselling opportunity.   So, you run the system and generate a list of the “to call” list.  The system has then separated your universe of customers into “good candidate” and “not good candidate” customer sets.  Based on the traditional definition, then the decision system is a targeting system.  However, consider the scenario in which, you decide to separate the list into three parts based on how strong prospects they are – “very likely”, “somewhat likely” and “not likely”.  Then, has the same system suddenly stopped being a targeting system, and become a broader classification system?  Of course not, and this example helps us refine the concept of targeting systems vis-à-vis classification systems.  A system can be a multi-level targeting system which is trying to assign of the possibly many levels of actions.  It is possible to have a targeting system with infinite levels, by simply requiring the system to output a “percent likelihood” instead of yes/no.

On the flip side, a two label classification system, which is trying to separate the given list of objects into two different labels shouldn’t really be considered a targeting system.  For example, consider the case in which the classification system is trying to classify a given fossil to be belonging to one of the many known dinosaur species.  As the age of the fossil and the other characteristics continue to get analyzed, the choice narrows down to either the Struthiomimus or Ornithomimus.  Since it now has two labels, does it suddenly become a targeting system?  We argue that it doesn’t, since the goal is to assign one of the labels to it.  The goal was not to target to do something based on that decision.

Thus, the real difference between a classification system and a targeting system is the intent. If you intend to target some objects out of given set to take one action (investigate, upsell, give a free upgrade, stop from boarding the plane), then that is a targeting system.  If you intend to assign one of the labels to the given object, then that is a classification system.

January 5th, 2012

Ethical/Policy issues concerning Targeting Systems

Targeting Systems can be used to target a very variety of “things”. The things can be people, tax filings, medical claims, flights, customers, and many others. For example, targeting systems can be used to find which of the tax filings (or medical claims) should be reviewed (targeted) for errors (or fraud), it is just as easy to envision a targeting system that reviews customers arriving at a department store for providing individual assistance. Casinos have been doing this for years to find the “high rollers”. Travel industry does this to find travelers who may be given upgrades.

Frequently though, targeting systems target people. And in even more specific cases, the system doing the targeting belongs to the government. It is in those cases that ethical, policy and privacy issues most commonly emerge in the public domain. One obvious reason is that if the targeting system belongs to the government, then it may be forced to declare more specifics of the system (hotels and airlines may have no obligation to declare their system to identify high profile travelers). As an example, US Food and Drug Administration has been publishing import alerts and import bulletins for many years (Incidentally, FDA’s risk targeting system is called PREDICT, you can watch their YouTube video here).

An interesting example of such a scenario is the Automated Targeting System, a US DHS system for every person who crosses the US borders. Although first implemented in the late 90s, the system was first discovered by the public in November 2006, when a mention of it appeared in the Federal Register. Since then, the system has been subject to many lawsuits, primarily from the ACLU and citizens concerned about their privacy. Bruce Schneier, author of books on privacy and computer security (including “Liars and Outliers: Enabling the Trust that Society Needs to Thrive“, wrote about ATS:

There is something un-American about a government program that uses secret criteria to collect dossiers on innocent people and shares that information with various agencies, all without any oversight. It’s the sort of thing you’d expect from the former Soviet Union or East Germany or China. And it doesn’t make us any safer from terrorism.

Outside of ATS, the broader questions are:

• Is it acceptable for the government to maintain “risk” files on its innocent citizens, especially as those files contain the output from a targeting system that is not transparent? (the storage aspect of those files is the main concern here)
• Specifically, is it acceptable for government to use an obscure and opaque risk targeting system (opaqueness of the system is the main concern here)

It is also worthwhile to consider one simple risk targeting system that everyone is largely OK with today – that system is the X-ray machine that we all use when going into secure facilities and airports etc. People have largely come to accept that machine, and are also comfortable with the simplicity and the transparency (no pun intended) of that specific risk targeting system.

December 29th, 2011

Confusion Matrix – Another Single Value Metric – Kappa Statistic

 Background: This is another in the line of posts on how to compare confusion matrices. The path, as has been taken in the past is in terms of using some aggregate objective function (or single value metric), that takes a confusion matrix and reduces it to one value.

In a previous post, we have discussed Matthews Correlation Coefficient, F1 measure, and reward/cost based single value metrics. Another single value metric (or aggregate objective function) that is worth discussing is the Kappa Statistic.

Kappa Statistic is interesting in the sense that it actually tries to compare the accuracy of the system to the accuracy of a random system.  To quote Richard Landis and Gary Koch from the 1977 paper The Measurement of Observer Agreement for Categorical Data, “..(total accuracy) is an observational probability of agreement and (random accuracy) is a hypothetical expected probability of agreement under an appropriate set of baseline constraints.”

Total accuracy is simply the sum of true positive and true negatives, divided by the total number of items, that is:

Random Accuracy is defined as the sum of the products of reference likelihood and result likelihood for each class. That is,

In terms of false positives etc, random accuracy can be written as:

I have taken the previous test case confusion matrices and added the Kappa to that as well. Here is a snapshot.

Two things about kappa statistic that are of further interest:

Firstly, it is a general statistic that can be used for classification systems, not just for targeting systems. Secondly, kappa statistic is normalized statistic, just like MCC. Its value never exceeds one, so the same statistic can be used even as the number of observations grows.

Here is the link to the PDF if that is of interest.

Switch to our mobile site