We looked at Kappa Statistic previously, and I have been evaluating some aspects of it again.
To remind ourselves, Kappa statistic is a measure of consistency amongst different raters, taking into account the agreement occurring by chance. The standard formula for kappa statistic is given as:
Firstly, an observation that I omitted to make: the value of kappa statistic can indeed be negative. The total accuracy can be lesser than random accuracy, and as CMAJ letter by Juurlink and Detsky points out, this may indicate genuine disagreement, or it may reflect a problem in the application of a diagnostic test.
Secondly, one thing to love about Kappa is the following. Consider the case that one actual class is much more prevalent than the other. In such case, a classification system that simply outputs the more prevalent class may have a high F1 measure (a high precision and high recall), but will have a very low value of kappa. For example, consider the scenario that we are asked if it will rain in Seattle and consider the following confusion matrix: