Inter-rater reliability calculation for multi-raters data

问题

I have the following list of lists:

[[1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 2, 0, 0, 1],
 [1, 1, 0, 2, 3, 1, 0, 1]]

Where I want to calculate an inter-rater reliability score, there are multiple raters(rows). I cannot use Fleiss' kappa, since the rows do not sum to the same number. What is a good approach in this case?

回答1:

The basic problem here is that you have not properly applied the data you're given. See here for the proper organization. You have four categories (ratings 0-3) and eight subjects. Thus, your table must have eight rows and four columns, regardless of the quantity of reviewers. For instance, the top row is the tally of ratings given to the first item:

[0, 4, 0, 0]   ... since everyone rated it a `1`.

Your -inf value is from dividing by 0 on the P[j] score for the penultimate column.

My earlier answer, normalizing the scores, was based on my misinterpretation of Fleiss; I had a different reliability in mind. There are many ways to compute such a metric; one is consistency of relative rating points (which you can get with normalization); another is to convert each rater's row into a graph of relative rankings, and compute a similarity among those graphs.

Note that Fleiss is not perfectly applicable to a rating situation with a relative metric: it assumes that this is a classification task, not a ranking. Fleiss is not sensitive to how far apart the ratings are; it knows only that the ratings differed: a (0,1) paring is just as damaging as a (0,3) pairing.

回答2:

The answer to this problem was to use krippendorff alpha score:

Wikipedia Description

Python Library

import krippendorff

arr = [[1, 1, 1, 1, 3, 0, 0, 1],
       [1, 1, 1, 1, 3, 0, 0, 1],
       [1, 1, 1, 1, 2, 0, 0, 1],
       [1, 1, 0, 2, 3, 1, 0, 1]]    
res = krippendorff.alpha(arr)

来源：https://stackoverflow.com/questions/56481245/inter-rater-reliability-calculation-for-multi-raters-data

标签

python

statistics

statsmodels