How to plot a ROC curve using ROCR package in r, *with only a classification contingency table*

守給你的承諾、 提交于 2019-12-05 06:27:18

You cannot generate the full ROC curve with a single contingency table because a contingency table provides only a single sensitivity/specificity pair (for whatever predictive cutoff was used to generate the contingency table).

If you had many contingency tables that were generated with different cutoffs, you would be able to approximate the ROC curve (basically it will be a linear interpolation between the sensitivity/specificity values in your contingency tables). As an example, let's consider predicting whether a flower is versicolor in the iris dataset using logistic regression:

iris$isv <- as.numeric(iris$Species == "versicolor")
mod <- glm(isv~Sepal.Length+Sepal.Width, data=iris, family="binomial")

We could use the standard ROCR code to compute the ROC curve for this model:

library(ROCR)
pred1 <- prediction(predict(mod), iris$isv)
perf1 <- performance(pred1,"tpr","fpr")
plot(perf1)

Now let's assume that instead of mod all we have is contingency tables with a number of cutoffs values for predictions:

tables <- lapply(seq(0, 1, .1), function(x) table(iris$isv, factor(predict(mod, type="response") >= x, levels=c(F, T))))

# Predict TRUE if predicted probability at least 0
tables[[1]]
#     FALSE TRUE
#   0     0  100
#   1     0   50

# Predict TRUE if predicted probability at least 0.5
tables[[6]]
#     FALSE TRUE
#   0    86   14
#   1    29   21

# Predict TRUE if predicted probability at least 1
tables[[11]]
#     FALSE TRUE
#   0   100    0
#   1    50    0

From one table to the next some predictions changed from TRUE to FALSE due to the increased cutoff, and by comparing column 1 of the successive table we can determine which of these represent true negative and false negative predictions. Iterating through our ordered list of contingency tables we can create fake predicted value/outcome pairs that we can pass to ROCR, ensuring that we match the sensitivity/specificity for each contingency table.

fake.info <- do.call(rbind, lapply(1:(length(tables)-1), function(idx) {
  true.neg <- tables[[idx+1]][1,1] - tables[[idx]][1,1]
  false.neg <- tables[[idx+1]][2,1] - tables[[idx]][2,1]
  if (true.neg <= 0 & false.neg <= 0) {
    return(NULL)
  } else {
    return(data.frame(fake.pred=idx,
                      outcome=rep(c(0, 1), times=c(true.neg, false.neg))))
  }
}))

Now we can pass the faked predictions to ROCR as usual:

pred2 <- prediction(fake.info$fake.pred, fake.info$outcome)
perf2 <- performance(pred2,"tpr","fpr")
plot(perf2)

Basically what we have done is a linear interpolation of the points that we do have on the ROC curve. If you had contingency tables for many cutoffs you could more closely approximate the true ROC curve. If you don't have a wide range of cutoffs you can't hope to accurately reproduce the full ROC curve.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!