Scoring

Since "the OPS-SAT case" is, at its core, a classification task we will be employing a simple transformation of Cohen’s kappa to score all submissions. Cohen’s kappa is a metric often used to assess the agreement between two raters, but it can also be used to assess the performance of a classification model and, as opposed to accuracy, it accounts for class imbalance. Let us indicate the confusion matrix of a given model, as computed over \(N\) samples, with \(M_{ij}\), then the Cohen's kappa metric is mathematically defined as:

\(\kappa = \frac{p_0-p_e}{1-p_e}\)

where:

\(p_0 = \frac{\sum_i M_{ii}}{N}\)

and,

\(p_e = \frac{1}{N^2} \sum_k (\sum_i M_{ki} \cdot \sum_i M_{ik})\)

a good explanation on its derivation and meaning can be found, for example, in the blog entry Multi-Class Metrics Made Simple, Part III: the Kappa Score (aka Cohen’s Kappa Coefficient). It essentially measures the agreement between the ground truth predictions and the model predictions, factoring out the probability that the agreement comes from a random accident.

In "the OPS-SAT case" the final score is computed as:

\(\mathcal L = 1 - \kappa\)

therefore \(\mathcal L \in [0,2]\) ... "reach the absolute zero" remember?