How to speed up rbind?

前端 未结 3 1550
暗喜
暗喜 2020-12-05 07:49

I\'m supposed to download a table from MS-SQL server.

The number of row is larger than 6million. The server cannot return entire data at once.

So, I wrote a

相关标签:
3条回答
  • 2020-12-05 08:09

    Here are a few options that I'm sure could be better:

    library(data.table)
    library(microbenchmark)
    
    #function to generate your data
    getData <- function(){
      data.frame(x=rnorm(10000),y=rnorm(10000),z=rnorm(10000))
    }
    
    #using data table's rbindlist each iteration
    fDT1 <- function(n){
      dat <- getData()
      for(i in 1:n){
        dat <- rbindlist(list(dat,getData()))
      }
      return(data.frame(dat))
    }
    
    #using data table's rbindlist all at once
    fDT2 <- function(n){
      return(data.frame(rbindlist(lapply(1:n,function(x) getData()))))
    }
    
    #pre-allocating a data frame
    fPre <- function(n){
      dat <- data.frame(x=rep(0,n*10000),y=rep(0,n*10000),z=rep(0,n*10000))
      j <- 1
      for(i in 1:n){
        dat[j:(j+10000-1),] <- getData()
        j <- j + 10000
      }
      return(dat)
    }
    
    #standard do.call rbind
    f2 <- function(n){
      return(do.call(rbind,lapply(1:n,function(x) getData())))
    }
    
    #current approach
    f <- function(n){
      dat <- getData()
      for(i in 1:n){
        dat <- rbind(dat,getData())
      }
      return(dat)
    }
    

    As you can see using data.table's rbindlist() is a big improvement over base R's rbind() and there is a big benefit in appending rows all at once instead of in interations, however that may not be possible if there are memory concerns. You may also note that the speed improvements are nowhere near linear as the size of data increases.

     > microbenchmark(fDT2(5),fDT1(5),fPre(5),f2(5),f(5),
    +                fDT2(25),fDT1(25),fPre(25),f2(25),f(25),
    +                fDT2(75),fDT1(75),fPre(75),f2(75),f(75),
    +                times=10)
    Unit: milliseconds
         expr        min         lq     median         uq         max neval
      fDT2(5)   18.31207   18.63969   24.09943   25.45590    72.01725    10
      fDT1(5)   27.65459   29.25147   36.34158   77.79446    88.82556    10
      fPre(5)   34.96257   39.39723   41.24445   43.30319    68.75897    10
        f2(5)   30.85883   33.00292   36.29100   43.53619    93.15869    10
         f(5)   87.40869   97.97500  134.50600  138.65354   147.67676    10
     fDT2(25)   89.42274   99.39819  103.90944  146.44160   156.01653    10
     fDT1(25)  224.65745  229.78129  261.52388  280.85499   300.93488    10
     fPre(25)  371.12569  412.79876  431.80571  485.37727  1046.96923    10
       f2(25)  221.03669  252.08998  265.17357  271.82414   281.47096    10
        f(25) 1446.32145 1481.01998 1491.59203 1634.99936  1849.00590    10
     fDT2(75)  326.66743  334.15669  367.83848  467.85480   520.27142    10
     fDT1(75) 1749.83842 1882.27091 2066.95241 2278.55589  2419.07205    10
     fPre(75) 3701.16220 3968.64643 4162.70585 4234.39716  4356.09462    10
       f2(75) 1174.47546 1183.98860 1314.64585 1421.09483  1537.42903    10
        f(75) 9139.36935 9349.24412 9510.90888 9977.24621 10861.51206    10
    
    0 讨论(0)
  • 2020-12-05 08:28

    As it's been pointed out above, R stores all its objects in RAM by default, so with that amount of data, you are going to run into some problems.

    Two things I'd like to add: 1) Generally, if you don't want to use data.table, you can use the rbind.fill function in Hadley's plyr package, which is quite fast, too. Never use rbind the way you did above, in a 'for' loop, appending each row separately. It forces R to make a copy of the data frame object every time you append one row, and that is slow.

    2) To work with larger-than-RAM data with R, have a look at the section Large memory and out-of-memory data at http://cran.r-project.org/web/views/HighPerformanceComputing.html, maybe the bigmemory package is what you need.

    0 讨论(0)
  • 2020-12-05 08:35

    Perhaps you could do SELECT COUNT(*) FROM table_name WHERE ... and then preallocate space for your data frame.

    Actually, I don't think querying your database per 10k rows is a good idea. Try to avoid that by exporting the data to a local disk and read from there. It will improve the speed as well. Storage is cheap, network bandwidth and memory are not.

    0 讨论(0)
提交回复
热议问题