In R ggplot2, include stat_ecdf() endpoints (0,0) and (1,1)

断了今生、忘了曾经 提交于 2019-12-04 12:17:49

问题


I'm trying to use stat_ecdf() to plot cumulative successes as a function of a rank score created by a predictive model.

#libraries
require(ggplot2)
require(scales)

# fake data for reproducibility
set.seed(123)
n <- 200
df <- data.frame(model_score= rexp(n=n,rate=1:n),
                 obs_set= sample(c("training","validation"),n,replace=TRUE))
df$model_rank <- rank(df$model_score)/n
df$target_outcome <- rbinom(n,1,1-df$model_rank)

# Plot Gain Chart using stat_ecdf()
ggplot(subset(df,target_outcome==1),aes(x = model_rank)) + 
  stat_ecdf(aes(colour = obs_set), size=1) + 
  scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
  xlab("Model Percentile") + ylab("Percent of Target Outcome") +
  scale_y_continuous(limits=c(0,1), labels=percent) +
  geom_segment(aes(x=0,y=0,xend=1,yend=1), 
               colour = "gray", linetype="longdash", size=1) +
  ggtitle("Gain Chart")

All I want to do is force the ECDF to start at (0,0) and end at (1,1) so that there are no gaps at the beginning or end of the curve. If possible, I'd like to do it within the syntax of ggplot2, but I'd settle for a clever workaround.

@Henrik this is NOT a duplicate of this question, because I have already defined my limits with scale_x_ and _y_continuous(), and adding expand_limits() doesn't do anything. It is not the origin of the PLOT but the endpoints of the stat_ecdf() that need fixed.


回答1:


Unfortunately, the definition of stat_ecdf gives no wiggle room here; it determines the endpoints internally.

There is a somewhat advanced solution. With the latest version of ggplot2 (devtools::install_github("hadley/ggplot2")), the extensibility is improved, to the point where it is possible to override this behavior, but not without some boilerplate.

stat_ecdf2 <- function(mapping = NULL, data = NULL, geom = "step",
                      position = "identity", n = NULL, show.legend = NA,
                      inherit.aes = TRUE, minval=NULL, maxval=NULL,...) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatEcdf2,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    stat_params = list(n = n, minval=minval,maxval=maxval),
    params = list(...)
  )
}


StatEcdf2 <- ggproto("StatEcdf2", StatEcdf,
  calculate = function(data, scales, n = NULL, minval=NULL, maxval=NULL, ...) {
    df <- StatEcdf$calculate(data, scales, n, ...)
    if (!is.null(minval)) { df$x[1] <- minval }
    if (!is.null(maxval)) { df$x[length(df$x)] <- maxval }
    df
  }
)

Now, stat_ecdf2 will behave the same as stat_ecdf, but with an optional minval and maxval parameter. So this will do the trick:

ggplot(subset(df,target_outcome==1),aes(x = model_rank)) +
  stat_ecdf2(aes(colour = obs_set), size=1, minval=0, maxval=1) +
  scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
  xlab("Model Percentile") + ylab("Percent of Target Outcome") +
  scale_y_continuous(limits=c(0,1), labels=percent) +
  geom_segment(aes(x=0,y=0,xend=1,yend=1),
               colour = "gray", linetype="longdash", size=1) +
  ggtitle("Gain Chart")

The big caveat here is that I don't know if the current extensibility model will be supported in the future; it has changed several times in the past, and the change to use "ggproto" is recent -- like July 15th 2015 recent.

As a plus, this gave me a chance to really dig into ggplot's internals, which is something that I've been meaning to do for a while.



来源:https://stackoverflow.com/questions/28609547/in-r-ggplot2-include-stat-ecdf-endpoints-0-0-and-1-1

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!