What methods can we use to reshape VERY large data sets?

前端未结

关注

 1  1186

When due to very large data calculations will take a long time and, hence, we don\'t want them to crash, it would be valuable to know beforehand which reshape method to use.

相关标签:

1条回答

Happy的楠姐

2020-12-09 08:35

If your real data is as regular as your sample data we can be quite efficient by noticing that reshaping a matrix is really just changing its dim attribute.

1st on very small data

library(data.table)
library(microbenchmark)
library(tidyr)

matrix_spread <- function(df1, key, value){
  unique_ids <-  unique(df1[[key]])
  mat <- matrix( df1[[value]], ncol= length(unique_ids),byrow = TRUE)
  df2 <- data.frame(unique(df1["tms"]),mat)
  names(df2)[-1] <- paste0(value,".",unique_ids)
  df2
}

n <- 3      
t1 <- 4
df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
df1$y <- rnorm(nrow(df1))

reshape(df1, idvar="tms", timevar="id", direction="wide")
#                    tms        y.1        y.2       y.3
# 1  1970-01-01 01:00:01  0.3518667  0.6350398 0.1624978
# 4  1970-01-01 01:00:02  0.3404974 -1.1023521 0.5699476
# 7  1970-01-01 01:00:03 -0.4142585  0.8194931 1.3857788
# 10 1970-01-01 01:00:04  0.3651138 -0.9867506 1.0920621

matrix_spread(df1, "id", "y")
#                    tms        y.1        y.2       y.3
# 1  1970-01-01 01:00:01  0.3518667  0.6350398 0.1624978
# 4  1970-01-01 01:00:02  0.3404974 -1.1023521 0.5699476
# 7  1970-01-01 01:00:03 -0.4142585  0.8194931 1.3857788
# 10 1970-01-01 01:00:04  0.3651138 -0.9867506 1.0920621

all.equal(check.attributes = FALSE,
          reshape(df1, idvar="tms", timevar="id", direction="wide"),
          matrix_spread (df1, "id", "y"))
# TRUE

Then on bigger data

(sorry I can't afford to make a huge computation now)

n <- 100      
t1 <- 5000

df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
df1$y <- rnorm(nrow(df1))

DT1 <- as.data.table(df1)

microbenchmark(reshape=reshape(df1, idvar="tms", timevar="id", direction="wide"),
               dcast=dcast(df1, tms ~ id, value.var="y"),
               dcast.dt=dcast(DT1, tms ~ id, value.var="y"),
               tidyr=spread(df1, id, y),
               matrix_spread = matrix_spread(df1, "id", "y"),
               times=3L)

# Unit: milliseconds
# expr                 min         lq       mean     median         uq        max neval
# reshape       4197.08012 4240.59316 4260.58806 4284.10620 4292.34203 4300.57786     3
# dcast           57.31247   78.16116   86.93874   99.00986  101.75189  104.49391     3
# dcast.dt       114.66574  120.19246  127.51567  125.71919  133.94064  142.16209     3
# tidyr           55.12626   63.91142   72.52421   72.69658   81.22319   89.74980     3
# matrix_spread   15.00522   15.42655   17.45283   15.84788   18.67664   21.50539     3

Not too bad!

About memory usage, I guess if reshape handles it my solution will, if you can work with my assumptions or preprocess the data to meet them:

data is sorted
we have 3 columns only
for all id values we find all tms values

0 讨论(0)