Fastest way to read large Excel xlsx files? To parallelize or not?

后端未结

关注

 2  1671

梦毁少年i 2020-12-18 00:24

My questions are:

What is the fastest way to read large(ish) .xlsx Excel files into R? 10 to 200 MB xlsx files, with multiple sheets.

2条回答

自闭症患者 (楼主)

2020-12-18 00:59

I saw @clemens answer but since I have prepared some stuffs, so I am posting it anyway. In addition to @clemens answer, I use a bigger test data, and run simpler multicore option using furrr::future_map() which does not give any performance gain in the end...

Data generation

This will create 10 sheets of 10000 * 15 data with mixture of float, int, and character. On my disk, the filesize is 13.2MB.

library(writexl)
library(tidyverse)

n <- 1e4
sample_data <- map(seq(10), function(x) {
  sample_data <-
    map(1:5, function(x){
      data_frame(
        num_var = rnorm(n),
        int_var = as.integer(sample(1e5:9e5, n, replace = T)),
        char_var = sample(letters, n, replace = T)
      ) %>% rename_all(funs(paste0(., x)))
    }) %>% bind_cols()
  return(sample_data)
})
fn <- tempfile(tmpdir = "~/Desktop/temp",fileext = ".xlsx")
write_xlsx(sample_data, path = fn)

Benchmark

The parallel part is borrowed from @clemens.

library(parallel)
library(readxl)

sheets <- excel_sheets(fn)

cl <- makeCluster(detectCores() - 1)
excel_path <- fn

microbenchmark::microbenchmark(
  map = map(sheets, function(x) read_xlsx(fn, sheet = x)) ,
  future_map = furrr::future_map(sheets, function(x) read_xlsx(fn, sheet = x)),
  parLapplyLB = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_xlsx(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

The benchmark result looks like this:

Unit: milliseconds
        expr       min        lq      mean    median       uq      max neval
         map 1258.2643 1272.2354 1351.8371 1291.2474 1430.211 1556.992    10
  future_map 1276.4125 1302.2022 1469.8349 1436.5356 1616.146 1702.494    10
 parLapplyLB  809.2697  863.3299  951.1041  914.9503 1014.907 1189.897    10

I have relatively weak CPU, so in other environment the gain must be big, but in the end, probably writing SQL part could be a bottle neck as reading is really fast for read_xlsx.

Note

I also tried other packages like gdata and xlsx. These are painstakingly slow, so not worth considering.

0 讨论(0)

查看其它2个回答