## Comparing Kappa Statistic to the F1 measure

We looked at Kappa Statistic previously, and I have been evaluating some aspects of it again.

To remind ourselves, Kappa statistic is a measure of consistency amongst different raters, taking into account the agreement occurring by chance. The standard formula for kappa statistic is given as:

Firstly, an observation that I omitted to make: the value of kappa statistic can indeed be negative. The total accuracy can be lesser than random accuracy, and as CMAJ letter by Juurlink and Detsky points out, this may indicate genuine disagreement, or it may reflect a problem in the application of a diagnostic test.

Secondly, one thing to love about Kappa is the following. Consider the case that one actual class is much more prevalent than the other. In such case, a classification system that simply outputs the more prevalent class may have a high F1 measure (a high precision and high recall), but will have a very low value of kappa. For example, consider the scenario that we are asked if it will rain in Seattle and consider the following confusion matrix:

Predicted | |||

F | T | ||

Actual | F | 1 | 99 |

T | 1 | 899 |

This is a null hypothesis model, in the sense that it almost always predicts the class to be “T”. In the case of this confusion matrix, precision is 0.9008 and recall is 0.9989. F1 measure is 0.9473 and can give an impression that this is a useful model. Kappa value is very low, at 0.0157, and gives a clear enough warning about the validity of this model.

In other words, while it may be easier to predict rain in Seattle (or sunshine in Aruba), kappa statistic tries to take away the bias in the actual distribution, while the F1 measure may not.

## Job scheduling mechanisms in Clouds

As more and more organizations have started to use the cloud for their computing needs (I have been using Amazon Web Services, commonly known as the EC2 cloud for more than two years now), a relatively new set of challenges has arisen. The cloud providers provide computing resources, while the organizations care about their analytical and computing jobs being completed, irrespective of the computing resources they require. The organizations would like to place a value on the job, instead of the resource, and there is currently a missing link between the two.

As a regular reviewer for Computing Reviews, I have just finished a formal review for Near-optimal Scheduling Mechanisms for Deadline-Sensitive jobs in Large Computing Clusters, and that work specifically tries to address these new kinds of challenges that we are all observing as the movement to cloud computing gathers further steam.

## Distributed tuning of machine learning algorithms using MapReduce clusters

My review of Ganjisaffar et al’s 2011 LDMTA paper is available at:

http://www.computingreviews.com/browse/browse_reviewers.cfm?reviewer_id=123480

## Comparing two confusion matrices

Comparing two confusion matrices is a standard approach for comparing the respective targeting systems, but by no means is it the only one. As we will discuss in the coming days, you can also compare two score based targeting systems by comparing their lists. But for now, let us focus on comparing the targeting systems by comparing their respective confusion matrices.

The standard approach is to use a single value metric to reduce each matrix into one value, and then to compare the metric values. In other words, to compare M1 and M2, we simply compare f(M1) and f(M2), where function f is the single value metric.

Here are some single value metrics that can be considered as candidates:

- Kappa Statistic
- F1 measure
- Matthews Correlation Coefficient
- Reward/Cost based
- Sensitivity (Recall)
- Specificity (Precision)

A related approach can also be to take a matrix difference of the two matrices, and then using a dot product (or scalar product), but it is easy to see that transforms to using reward/cost based metric. [A.C – B.C = (A-B).C etc.]