Fast way to read xlsx files into R

社会主义新天地 提交于 2019-12-18 19:14:39

问题


It is a follow-up question to this one. What is the fastest way to read .xlsx files into R ?

I use library(xlsx) to read in data from 36 .xlsx files. It works. However, the problem is that this is very time consuming (well over 30 minutes), especially when considering the data in each file is not that large (matrix of size 3*3652 in each file). To this end, is there a better to deal with such problem, please? Is there another quick way to read .xlsx into R? Or can I put the 36 files into a single csv file quickly and then read into R?

Moreover, I just realised that readxl cannot write xlsx. Is there a counterpart of it to deal with writing instead of reading?

"Response to those voted this question down":

This question is about fact instead of the so-called "opinionated answers and spam" because speed is time and time is fact but NOT opinion.

Further update:

Perhaps one can explain to us in plain language why some method works much faster than others. I am certainly confused about this.


回答1:


Here is a small benchmark test. Results: readxl::read_xlsx on average about twice as fast as openxlsx::read.xlsx across different number of rows (n) and columns (p) using standard settings.

options(scipen=999)  # no scientific number format

nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
pp <- c(1, 5, 10, 20, 30, 40, 50)

# create some excel files
l <- list()  # save results
tmp_dir <- tempdir()

for (n in nn) {
  for (p in pp) {
    name <-  
    cat("\n\tn:", n, "p:", p)
    flush.console()
    m <- matrix(rnorm(n*p), n, p)
    file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")

    # write
    write.xlsx(m, file)

    # read
    elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
    df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

    elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
    df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

  }
}

# results 
d <- do.call(rbind, l)

library(ggplot2)

ggplot(d, aes(n, elapsed, color= fun)) + 
  geom_line() + geom_point() +  
  facet_wrap( ~ paste("columns:", p)) +
  xlab("Number of rows") +
  ylab("Seconds")


来源:https://stackoverflow.com/questions/44538199/fast-way-to-read-xlsx-files-into-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!