How to append rows to an R data frame

前端 未结 7 1579
太阳男子
太阳男子 2020-11-28 02:04

I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.

I am initializing an empty

7条回答
  •  孤街浪徒
    2020-11-28 02:21

    Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:

    1. rbindlist to the data.frame

    2. Use data.table's fast set operation and couple it with manually doubling the table when needed.

    3. Use RSQLite and append to the table held in memory.

    4. data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.

    Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:

    • create(first_element) that returns the appropriate backing object with first_element put in.

    • append(object, element) that appends the element to the end of the table (represented by object).

    • access(object) gets the data.frame with all the inserted elements.

    rbindlist to the data.frame

    That is quite easy and straight-forward:

    create.1<-function(elems)
    {
      return(as.data.table(elems))
    }
    
    append.1<-function(dt, elems)
    { 
      return(rbindlist(list(dt,  elems),use.names = TRUE))
    }
    
    access.1<-function(dt)
    {
      return(dt)
    }
    

    data.table::set + manually doubling the table when needed.

    I will store the true length of the table in a rowcount attribute.

    create.2<-function(elems)
    {
      return(as.data.table(elems))
    }
    
    append.2<-function(dt, elems)
    {
      n<-attr(dt, 'rowcount')
      if (is.null(n))
        n<-nrow(dt)
      if (n==nrow(dt))
      {
        tmp<-elems[1]
        tmp[[1]]<-rep(NA,n)
        dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
        setattr(dt,'rowcount', n)
      }
      pos<-as.integer(match(names(elems), colnames(dt)))
      for (j in seq_along(pos))
      {
        set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
      }
      setattr(dt,'rowcount',n+1)
      return(dt)
    }
    
    access.2<-function(elems)
    {
      n<-attr(elems, 'rowcount')
      return(as.data.table(elems[1:n,]))
    }
    

    SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution

    This is basically copy&paste of Karsten W. answer on similar thread.

    create.3<-function(elems)
    {
      con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
      RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
      return(con)
    }
    
    append.3<-function(con, elems)
    { 
      RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
      return(con)
    }
    
    access.3<-function(con)
    {
      return(RSQLite::dbReadTable(con, "t", row.names=NULL))
    }
    

    data.frame's own row-appending + custom environment.

    create.4<-function(elems)
    {
      env<-new.env()
      env$dt<-as.data.frame(elems)
      return(env)
    }
    
    append.4<-function(env, elems)
    { 
      env$dt[nrow(env$dt)+1,]<-elems
      return(env)
    }
    
    access.4<-function(env)
    {
      return(env$dt)
    }
    

    The test suite:

    For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).

    test<-function(id, n=1000)
    {
      n<-n-1
      el<-list(a=1,b=2,c=3,d=4)
      o<-do.call(paste0('create.',id),list(el))
      s<-paste0('append.',id)
      for (i in 1:n)
      {
        o<-do.call(s,list(o,el))
      }
      return(do.call(paste0('access.', id), list(o)))
    }
    

    Let's see the performance for n=10 insertions.

    I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.

    r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
    autoplot(r)
    

    For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz):

    nr  function      time
    4   data.frame    228.251 
    3   sqlite        133.716
    2   data.table      3.059
    1   rbindlist     169.998 
    0   placebo         0.202
    

    It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!

    Summary

    If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.

    For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).

提交回复
热议问题