If your submission is valid, we compute a score that consists of two components:

- \(1 - F_1\)
- Regression Error (MSE)

The \(F_1\)-score is a popular statistic that takes both, precision and recall into consideration. It is computed by

$$ F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$

but we will formalize this later. In the leaderboard, \(1 - F_1\) is used to rank your submissions, while the MSE serves as a tie-breaker. True to the motto of Kelvins: "Reach the absolute zero!", a score of (0, 0) would mean that you predicted all the objects within the given pixel-tolerances without introducing false positives. Generally speaking, your score will be worse if you

- Miss objects (false negatives)
- Predict something that has not been annotated as an object (false positives)
- Are imprecise in your predictions (increasing MSE)

The total score of your submission constitutes an average of each score for each frame in the test-set. The exact details can be found in the following document: spotGEO_metric.pdf. Furthermore, the starter-kit (this link will take you to "Zenodo") provides the code which is run on the Kelvins server to compute the score.

To understand how the score is defined, we explain its computation for a fixed frame \(f\) in the following.

Assume that for a frame \(f\) we have a set of \(N\) ground-truth coordinates \(Y^f = (y_1^f, \ldots, y_N^f)\) and a submitted prediction with \(M_f\) coordinates \(X^f= (x_1^f, \ldots, x_{M_f}^f)\). In a first step, we compute a *one-to-one matching* between \(Y^f\) and \(X^f\) to determine, which predictions are likely meant to indicate which object. There are multiple ways on how to define such a matching, but for the scoring we decided to compute the matching which minimizes the sum of all truncated Euclidean distances of matched points (while maximizing the number of matched pairs). Formally, this constitutes the solution to a **minimum weighted unbalanced assignment problem**, which can be easily solved by the Hungarian algorithm, linear programming or similar.

Computing the matching like this will (in most cases) result in each object being matched with its nearest neighbor prediction. In cases, where one prediction would be the nearest neighbor for two objects (with no other candidates), the prediction would be associated to the closest of the two objects. In general, this matching is trying to minimize the distances between objects and predictions, which is important because there exists a border-radius \(\tau\) around each true object, in which your prediction has to fall to be (potentially) be counted as a true positive. A second, smaller tolerance-radius \(\varepsilon < \tau\) determines the contribution of your prediction to the regression error. If \(y_i\) is matched with \(x_j\) and \(d = d(x_j, y_i) = || x_j - y_i ||_2\) is the Euclidean distance between both, then the predictions counts as a **true positive** if \(d \leq \tau\).

The additive contribution of this matched pair to the regression error depends on the distance \(d\):

- if \(d < \varepsilon\), the contribution to the regression error is 0
- if \(\varepsilon \leq d \leq \tau\), the contribution to the regression error is \(d^2\)

For cases, in which \(d > \tau\), the prediction is not counted as a true positive and contributes a fixed \(\tau^2\) to the regression error. Since this matching is outside the tolerance, it introduces one **false negative** and one **false positive**. Together with unmatched objects (which count as **false negatives** each) and unmatched predictions (which count as **false positives** each), we end up with the following sets

- \(TP^f = \) true positives of frame \(f\)
- \(FP^f = \) false positives of frame \(f\)
- \(FN^f = \) false negatives of frame \(f\)

and define the sum squared error \((SSE^f)\) of frame \(f\) by:

$$ SSE^f = \sum\limits_{(i,j) \in TP^f} \pi(x_i, y_i) + \sum\limits_{j \in FN^f} \tau^2 + \sum\limits_{i \in TP^f} \tau^2 $$

where \(\pi(x_i, y_i) = 0\) if \(||x_i - y_j||_2 < \varepsilon\) and \(\pi(x_i, y_i) = ||x_i - y_j||_2^2\) otherwise.

In the special case that all \(TP^f, FP^f\) and \(FN^f\) are \(\emptyset\), we define \(SSE^f\) to be 0.

To the right, you can see an example prediction of \(\lbrace x_1, x_2, x_3, x_4\rbrace\) for true objects \(\lbrace y_1, y_2, y_3 \rbrace\), omitting the \(f\) notation of the frame for clarity. Solving the assignment problem, the matching that minimizes the sum over all distances puts \((y_1, x_1), (y_2, x_2)\) and \((y_3, x_3)\) together with \(x_4\) being unmatched.

- \(x_1\) is a true positive, because \(d(x_1, y_1) \leq \tau\).
- \(x_1\) contributes 0 the the regression error, because \(d(x_1, y_1) \leq \varepsilon\).
- \(x_2\) gets matched to \(y_2\) because \( d(x_2, y_2) \leq d(x_2, y_3)\).
- \(x_2\) is a true positive, because \(d(x_2, y_2) \leq \tau\).
- \(x_2\) contributes \(d(x_2, y_2)^2\) to the regression error, because \(d(x_2, y_2) > \varepsilon\).
- \(x_3\) is a false positive because \(d(x_3, y_3) > \tau\). Because of that, \(x_3\) contributes \(\tau^2\) to the regression error.
- \(x_4\) is a false positive because it cannot be matched, contributing an additional \(\tau^2\) to the regression error.
- \(y_3\), despite the fact that it has a prediction in its border-radius, is a false negative as \(x_2\) was matched already to \(y_2\). This contributes another \(\tau^2\) to the regression error.

To summarize, we have:

\(TP = \lbrace (1,1), (2,2) \rbrace\), \(FP = \lbrace 3, 4 \rbrace\), \(FN = \lbrace 3 \rbrace\)

and a regression error of

\(SSE = 0 + d(x_2, y_2)^2 + 3 \tau^2\).

For a given sequence of 5 frames, we have

\(TP = \sum\limits_{f = 1}^{5} |TP^f|, FP = \sum\limits_{f = 1}^{5} |FP^f|, FN = \sum\limits_{f = 1}^{5} |FN^f|\) and \(SSE = \sum\limits_{f = 1}^{5} SSE^f\).

We further define

$$ MSE = \frac{SSE}{TP + FN + FP}. $$

(again, \(MSE = 0\) if \(SSE = 0.\))

For a whole submission, consisting of \(K\) sequences, where we denote with \(TP_k\) the true positives of the \(k\)-th sequence (and so on with \(FP_k, FN_k, SSE_k\)), we compute precision \(P\), recall \(R\) and \(MSE\) as follows:

$$ P = \frac{\sum\limits_{k=1}^{K}TP_k}{\sum\limits_{k=1}^{K}TP_k + FP_k} \qquad R = \frac{\sum\limits_{k=1}^{K}TP_k}{\sum\limits_{k=1}^{K}TP_k + FN_k} \qquad MSE = \frac{\sum\limits_{k=1}^{K}SSE_k}{\sum\limits_{k=1}^{K}TP_k + FN_k + FP_k}. $$

With this we can compute \(F_1 = \frac{2 PR}{P + R}\). Our ranking is determined by \(1 - F_1\) with the **smallest number** being the winner. \(MSE\) is used as a tiebreaker.