Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

前端 未结 4 1564
忘了有多久
忘了有多久 2020-12-02 19:28

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understandi

4条回答
  •  孤街浪徒
    2020-12-02 19:29

    rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.

    # Use data from Josh O'Brien's post.
    set.seed(21)
    X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
    system.time({
    Names <- names(X[[1]])  # Get data.frame names from first list element.
    # For each name, extract its values from each data.frame in the list.
    # This provides a list with an element for each name.
    Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
    names(Xb) <- Names          # Give Xb the correct names.
    Xb.df <- as.data.frame(Xb)  # Convert Xb to a data.frame.
    })
    #    user  system elapsed 
    #   3.356   0.024   3.388 
    system.time(X1 <- do.call(rbind, X))
    #    user  system elapsed 
    # 169.627   6.680 179.675
    identical(X1,Xb.df)
    # [1] TRUE
    

    Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)

    # My "rbind list" function
    rbl.ju <- function(x) {
      u <- unlist(x, recursive=FALSE)
      n <- names(u)
      un <- unique(n)
      l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
      names(l) <- un
      d <- as.data.frame(l)
    }
    # simple wrapper to rbindlist that returns a data.frame
    rbl.dt <- function(x) {
      as.data.frame(rbindlist(x))
    }
    
    library(data.table)
    if(packageVersion("data.table") >= '1.8.2') {
      system.time(dt <- rbl.dt(X))  # rbindlist only exists in recent versions
    }
    #    user  system elapsed 
    #    0.02    0.00    0.02
    system.time(ju <- rbl.ju(X))
    #    user  system elapsed 
    #    0.05    0.00    0.05 
    identical(dt,ju)
    # [1] TRUE
    

提交回复
热议问题