Speed up the loop operation in R

前端 未结 10 2326
说谎
说谎 2020-11-22 00:04

I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and a

10条回答
  •  生来不讨喜
    2020-11-22 00:17

    I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.

    Frequently I use data.frames as one would use lists such as df$var[i]

    Here is a made up example:

    nrow=function(x){ ##required as I use nrow at times.
      if(class(x)=='list') {
        length(x[[names(x)[1]]])
      }else{
        base::nrow(x)
      }
    }
    
    system.time({
      d=data.frame(seq=1:10000,r=rnorm(10000))
      d$foo=d$r
      d$seq=1:5
      mark=NA
      for(i in 1:nrow(d)){
        if(d$seq[i]==1) mark=d$r[i]
        d$foo[i]=mark
      }
    })
    
    system.time({
      d=data.frame(seq=1:10000,r=rnorm(10000))
      d$foo=d$r
      d$seq=1:5
      d=as.list(d) #become a list
      mark=NA
      for(i in 1:nrow(d)){
        if(d$seq[i]==1) mark=d$r[i]
        d$foo[i]=mark
      }
      d=as.data.frame(d) #revert back to data.frame
    })
    

    data.frame version:

       user  system elapsed 
       0.53    0.00    0.53
    

    list version:

       user  system elapsed 
       0.04    0.00    0.03 
    

    17x times faster to use a list of vectors than a data.frame.

    Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...

    For even faster code do this class(d)='list' instead of d=as.list(d) and class(d)='data.frame'

    system.time({
      d=data.frame(seq=1:10000,r=rnorm(10000))
      d$foo=d$r
      d$seq=1:5
      class(d)='list'
      mark=NA
      for(i in 1:nrow(d)){
        if(d$seq[i]==1) mark=d$r[i]
        d$foo[i]=mark
      }
      class(d)='data.frame'
    })
    head(d)
    

提交回复
热议问题