Read an Excel file directly from a R script

后端 未结 12 928
误落风尘
误落风尘 2020-11-22 13:52

How can I read an Excel file directly into R? Or should I first export the data to a text- or CSV file and import that file into R?

12条回答
  •  甜味超标
    2020-11-22 14:21

    Given the proliferation of different ways to read an Excel file in R and the plethora of answers here, I thought I'd try to shed some light on which of the options mentioned here perform the best (in a few simple situations).

    I myself have been using xlsx since I started using R, for inertia if nothing else, and I recently noticed there doesn't seem to be any objective information about which package works better.

    Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others, and a waterfall of other caveats.

    That said, I'm using a (reproducible) data set that I think is in a pretty common format (8 string fields, 3 numeric, 1 integer, 3 dates):

    set.seed(51423)
    data.frame(
      str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
      str2 = sample(sprintf("%09d", 1:NN)),  #ID field 2
      #varying length string field--think names/addresses, etc.
      str3 = 
        replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
                             collapse = "")),
      #factor-like string field with 50 "levels"
      str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
      #factor-like string field with 17 levels, varying length
      str5 = 
        sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
                                     collapse = "")), NN, TRUE),
      #lognormally distributed numeric
      num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
      #3 binary strings
      str6 = sample(c("Y","N"), NN, TRUE),
      str7 = sample(c("M","F"), NN, TRUE),
      str8 = sample(c("B","W"), NN, TRUE),
      #right-skewed integer
      int1 = ceiling(rexp(NN)),
      #dates by month
      dat1 = 
        sample(seq(from = as.Date("2005-12-31"), 
                   to = as.Date("2015-12-31"), by = "month"),
               NN, TRUE),
      dat2 = 
        sample(seq(from = as.Date("2005-12-31"), 
                   to = as.Date("2015-12-31"), by = "month"),
               NN, TRUE),
      num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
      #date by day
      dat3 = 
        sample(seq(from = as.Date("2015-06-01"), 
                   to = as.Date("2015-07-15"), by = "day"),
               NN, TRUE),
      #lognormal numeric that can be positive or negative
      num3 = 
        (-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
    )
    

    I then wrote this to csv and opened in LibreOffice and saved it as an .xlsx file, then benchmarked 4 of the packages mentioned in this thread: xlsx, openxlsx, readxl, and gdata, using the default options (I also tried a version of whether or not I specify column types, but this didn't change the rankings).

    I'm excluding RODBC because I'm on Linux; XLConnect because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks, so to put its horse in the race on only its reading capabilities seems unfair; and xlsReadWrite because it is no longer compatible with my version of R (seems to have been phased out).

    I then ran benchmarks with NN=1000L and NN=25000L (resetting the seed before each declaration of the data.frame above) to allow for differences with respect to Excel file size. gc is primarily for xlsx, which I've found at times can create memory clogs. Without further ado, here are the results I found:

    1,000-Row Excel File

    benchmark1k <-
      microbenchmark(times = 100L,
                     xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
                     openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
                     readxl = {readxl::read_excel(fl); invisible(gc())},
                     gdata = {gdata::read.xls(fl); invisible(gc())})
    
    # Unit: milliseconds
    #      expr       min        lq      mean    median        uq       max neval
    #      xlsx  194.1958  199.2662  214.1512  201.9063  212.7563  354.0327   100
    #  openxlsx  142.2074  142.9028  151.9127  143.7239  148.0940  255.0124   100
    #    readxl  122.0238  122.8448  132.4021  123.6964  130.2881  214.5138   100
    #     gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345   100
    

    So readxl is the winner, with openxlsx competitive and gdata a clear loser. Taking each measure relative to the column minimum:

    #       expr   min    lq  mean median    uq   max
    # 1     xlsx  1.59  1.62  1.62   1.63  1.63  1.65
    # 2 openxlsx  1.17  1.16  1.15   1.16  1.14  1.19
    # 3   readxl  1.00  1.00  1.00   1.00  1.00  1.00
    # 4    gdata 16.43 16.62 15.77  16.67 16.25 11.31
    

    We see my own favorite, xlsx is 60% slower than readxl.

    25,000-Row Excel File

    Due to the amount of time it takes, I only did 20 repetitions on the larger file, otherwise the commands were identical. Here's the raw data:

    # Unit: milliseconds
    #      expr        min         lq       mean     median         uq        max neval
    #      xlsx  4451.9553  4539.4599  4738.6366  4762.1768  4941.2331  5091.0057    20
    #  openxlsx   962.1579   981.0613   988.5006   986.1091   992.6017  1040.4158    20
    #    readxl   341.0006   344.8904   347.0779   346.4518   348.9273   360.1808    20
    #     gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826    20
    

    Here's the relative data:

    #       expr    min     lq   mean median     uq    max
    # 1     xlsx  13.06  13.16  13.65  13.75  14.16  14.13
    # 2 openxlsx   2.82   2.84   2.85   2.85   2.84   2.89
    # 3   readxl   1.00   1.00   1.00   1.00   1.00   1.00
    # 4    gdata 128.62 128.67 129.22 129.86 129.69 126.75
    

    So readxl is the clear winner when it comes to speed. gdata better have something else going for it, as it's painfully slow in reading Excel files, and this problem is only exacerbated for larger tables.

    Two draws of openxlsx are 1) its extensive other methods (readxl is designed to do only one thing, which is probably part of why it's so fast), especially its write.xlsx function, and 2) (more of a drawback for readxl) the col_types argument in readxl only (as of this writing) accepts some nonstandard R: "text" instead of "character" and "date" instead of "Date".

提交回复
热议问题