What methods can we use to reshape VERY large data sets?

前端 未结 1 1186
悲&欢浪女
悲&欢浪女 2020-12-09 07:41

When due to very large data calculations will take a long time and, hence, we don\'t want them to crash, it would be valuable to know beforehand which reshape method to use.

相关标签:
1条回答
  • 2020-12-09 08:35

    If your real data is as regular as your sample data we can be quite efficient by noticing that reshaping a matrix is really just changing its dim attribute.

    1st on very small data

    library(data.table)
    library(microbenchmark)
    library(tidyr)
    
    matrix_spread <- function(df1, key, value){
      unique_ids <-  unique(df1[[key]])
      mat <- matrix( df1[[value]], ncol= length(unique_ids),byrow = TRUE)
      df2 <- data.frame(unique(df1["tms"]),mat)
      names(df2)[-1] <- paste0(value,".",unique_ids)
      df2
    }
    
    n <- 3      
    t1 <- 4
    df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
    df1$y <- rnorm(nrow(df1))
    
    reshape(df1, idvar="tms", timevar="id", direction="wide")
    #                    tms        y.1        y.2       y.3
    # 1  1970-01-01 01:00:01  0.3518667  0.6350398 0.1624978
    # 4  1970-01-01 01:00:02  0.3404974 -1.1023521 0.5699476
    # 7  1970-01-01 01:00:03 -0.4142585  0.8194931 1.3857788
    # 10 1970-01-01 01:00:04  0.3651138 -0.9867506 1.0920621
    
    matrix_spread(df1, "id", "y")
    #                    tms        y.1        y.2       y.3
    # 1  1970-01-01 01:00:01  0.3518667  0.6350398 0.1624978
    # 4  1970-01-01 01:00:02  0.3404974 -1.1023521 0.5699476
    # 7  1970-01-01 01:00:03 -0.4142585  0.8194931 1.3857788
    # 10 1970-01-01 01:00:04  0.3651138 -0.9867506 1.0920621
    
    all.equal(check.attributes = FALSE,
              reshape(df1, idvar="tms", timevar="id", direction="wide"),
              matrix_spread (df1, "id", "y"))
    # TRUE
    

    Then on bigger data

    (sorry I can't afford to make a huge computation now)

    n <- 100      
    t1 <- 5000
    
    df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
    df1$y <- rnorm(nrow(df1))
    
    DT1 <- as.data.table(df1)
    
    microbenchmark(reshape=reshape(df1, idvar="tms", timevar="id", direction="wide"),
                   dcast=dcast(df1, tms ~ id, value.var="y"),
                   dcast.dt=dcast(DT1, tms ~ id, value.var="y"),
                   tidyr=spread(df1, id, y),
                   matrix_spread = matrix_spread(df1, "id", "y"),
                   times=3L)
    
    # Unit: milliseconds
    # expr                 min         lq       mean     median         uq        max neval
    # reshape       4197.08012 4240.59316 4260.58806 4284.10620 4292.34203 4300.57786     3
    # dcast           57.31247   78.16116   86.93874   99.00986  101.75189  104.49391     3
    # dcast.dt       114.66574  120.19246  127.51567  125.71919  133.94064  142.16209     3
    # tidyr           55.12626   63.91142   72.52421   72.69658   81.22319   89.74980     3
    # matrix_spread   15.00522   15.42655   17.45283   15.84788   18.67664   21.50539     3 
    

    Not too bad!

    About memory usage, I guess if reshape handles it my solution will, if you can work with my assumptions or preprocess the data to meet them:

    • data is sorted
    • we have 3 columns only
    • for all id values we find all tms values
    0 讨论(0)
提交回复
热议问题