How can I prevent rbind() from geting really slow as dataframe grows larger?

前端 未结 2 1999
鱼传尺愫
鱼传尺愫 2020-12-01 17:34

I have a dataframe with only 1 row. To this I start to add rows by using rbind

df #mydataframe with only one row
for (i in 1:20000)
{
    df<- rbind(df, n         


        
相关标签:
2条回答
  • 2020-12-01 17:48

    You are in the 2nd circle of hell, namely failing to pre-allocate data structures.

    Growing objects in this fashion is a Very Very Bad Thing in R. Either pre-allocate and insert:

    df <- data.frame(x = rep(NA,20000),y = rep(NA,20000))
    

    or restructure your code to avoid this sort of incremental addition of rows. As discussed at the link I cite, the reason for the slowness is that each time you add a row, R needs to find a new contiguous block of memory to fit the data frame in. Lots 'o copying.

    0 讨论(0)
  • 2020-12-01 17:48

    I tried an example. For what it's worth, it agrees with the user's assertion that inserting rows into the data frame is also really slow. I don't quite understand what's going on, as I would have expected the allocation problem to trump the speed of copying. Can anyone either replicate this, or explain why the results below (rbind < appending < insertion) would be true in general, or explain why this is not a representative example (e.g. data frame too small)?

    edit: the first time around I forgot to initialize the object in hell2fun to a data frame, so the code was doing matrix operations rather than data frame operations, which are much faster. If I get a chance I'll extend the comparison to data frame vs. matrix. The qualitative assertions in the first paragraph hold, though.

    N <- 1000
    set.seed(101)
    r <- matrix(runif(2*N),ncol=2)
    
    ## second circle of hell
    hell2fun <- function() {
        df <- as.data.frame(rbind(r[1,])) ## initialize
        for (i in 2:N) {
            df <- rbind(df,r[i,])
        }
    }
    
    insertfun <- function() {
        df <- data.frame(x=rep(NA,N),y=rep(NA,N))
        for (i in 1:N) {
            df[i,] <- r[i,]
        }
    }
    
    rsplit <- as.list(as.data.frame(t(r)))
    rbindfun <-  function() {
        do.call(rbind,rsplit)
    }
    
    library(rbenchmark)
    benchmark(hell2fun(),insertfun(),rbindfun())
    
    ##          test replications elapsed relative user.self 
    ## 1  hell2fun()          100  32.439  484.164    31.778 
    ## 2 insertfun()          100  45.486  678.896    42.978 
    ## 3  rbindfun()          100   0.067    1.000     0.076 
    
    0 讨论(0)
提交回复
热议问题