Since there are several ways to solve this, and none is really generic (see https://stats.stackexchange.com/questions/202336/true-positive-false-negative-true-negative-false-positive-definitions-for-mul?noredirect=1&lq=1 and
https://stats.stackexchange.com/questions/51296/how-do-you-calculate-precision-and-recall-for-multiclass-classification-using-co#51301), here is the solution that seems to be used in the paper which I was unclear about:
to count confusion between two foreground pages as false positive
So the solution is to import numpy as np, use y_true and y_prediction as np.array, then:
FP = np.logical_and(y_true != y_prediction, y_prediction != -1).sum() # 9
FN = np.logical_and(y_true != y_prediction, y_prediction == -1).sum() # 4
TP = np.logical_and(y_true == y_prediction, y_true != -1).sum() # 3
TN = np.logical_and(y_true == y_prediction, y_true == -1).sum() # 1
TPR = 1. * TP / (TP + FN) # 0.42857142857142855
FPR = 1. * FP / (FP + TN) # 0.9