Fastest way to read large Excel xlsx files? To parallelize or not?

后端 未结 2 1661
梦毁少年i
梦毁少年i 2020-12-18 00:24

My questions are:

  • What is the fastest way to read large(ish) .xlsx Excel files into R? 10 to 200 MB xlsx files, with multiple sheets.

相关标签:
2条回答
  • 2020-12-18 00:59

    I saw @clemens answer but since I have prepared some stuffs, so I am posting it anyway. In addition to @clemens answer, I use a bigger test data, and run simpler multicore option using furrr::future_map() which does not give any performance gain in the end...

    Data generation

    This will create 10 sheets of 10000 * 15 data with mixture of float, int, and character. On my disk, the filesize is 13.2MB.

    library(writexl)
    library(tidyverse)
    
    n <- 1e4
    sample_data <- map(seq(10), function(x) {
      sample_data <-
        map(1:5, function(x){
          data_frame(
            num_var = rnorm(n),
            int_var = as.integer(sample(1e5:9e5, n, replace = T)),
            char_var = sample(letters, n, replace = T)
          ) %>% rename_all(funs(paste0(., x)))
        }) %>% bind_cols()
      return(sample_data)
    })
    fn <- tempfile(tmpdir = "~/Desktop/temp",fileext = ".xlsx")
    write_xlsx(sample_data, path = fn)
    

    Benchmark

    The parallel part is borrowed from @clemens.

    library(parallel)
    library(readxl)
    
    sheets <- excel_sheets(fn)
    
    cl <- makeCluster(detectCores() - 1)
    excel_path <- fn
    
    microbenchmark::microbenchmark(
      map = map(sheets, function(x) read_xlsx(fn, sheet = x)) ,
      future_map = furrr::future_map(sheets, function(x) read_xlsx(fn, sheet = x)),
      parLapplyLB = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
        readxl::read_xlsx(excel_path, sheet = sheet)
      }, excel_path)},
      times = 10
    )
    

    The benchmark result looks like this:

    Unit: milliseconds
            expr       min        lq      mean    median       uq      max neval
             map 1258.2643 1272.2354 1351.8371 1291.2474 1430.211 1556.992    10
      future_map 1276.4125 1302.2022 1469.8349 1436.5356 1616.146 1702.494    10
     parLapplyLB  809.2697  863.3299  951.1041  914.9503 1014.907 1189.897    10
    
    

    I have relatively weak CPU, so in other environment the gain must be big, but in the end, probably writing SQL part could be a bottle neck as reading is really fast for read_xlsx.

    Note

    I also tried other packages like gdata and xlsx. These are painstakingly slow, so not worth considering.

    0 讨论(0)
  • 2020-12-18 01:00

    You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

    library(parallel)
    library(readxl)
    
    excel_path <- ""
    sheets <- excel_sheets(excel_path)
    

    Make a cluster with a specified number of cores:

    cl <- makeCluster(detectCores() - 1)
    

    Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

    parLapplyLB(cl, sheets, function(sheet, excel_path) {
      readxl::read_excel(excel_path, sheet = sheet)
    }, excel_path)
    

    You can use the package microbenchmark to test how fast certain options are:

    library(microbenchmark)
    
    microbenchmark(
      lapply = {lapply(sheets, function(sheet) {
        read_excel(excel_path, sheet = sheet)
      })},
      parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
        readxl::read_excel(excel_path, sheet = sheet)
      }, excel_path)},
      times = 10
    )
    

    In my case, the parallel version is faster:

    Unit: milliseconds
         expr       min        lq     mean    median        uq      max neval
       lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
     parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10
    

    The test file contains of 6 sheets, each containing this table:

        test test1 test3 test4 test5
     1     1     1     1     1     1
     2     2     2     2     2     2
     3     3     3     3     3     3
     4     4     4     4     4     4
     5     5     5     5     5     5
     6     6     6     6     6     6
     7     7     7     7     7     7
     8     8     8     8     8     8
     9     9     9     9     9     9
    10    10    10    10    10    10
    11    11    11    11    11    11
    12    12    12    12    12    12
    13    13    13    13    13    13
    14    14    14    14    14    14
    15    15    15    15    15    15
    

    Note: you can use stopCluster(cl) to shut down the workers when the process is finished.

    0 讨论(0)
提交回复
热议问题