Specifying Column Types when Importing xlsx Data to R with Package readxl

后端 未结 6 1494
梦如初夏
梦如初夏 2020-12-13 07:40

I\'m importing xlsx 2007 tables into R 3.2.1patched using package readxl 0.1.0 under Windows 7 64. The tables\' size is

6条回答
  •  暖寄归人
    2020-12-13 07:54

    The internal funcitons for guessing column types can be set to any number of rows to scan. But read_excel()doesn't implement that (yet?).

    The solution below is just a rewrite of the orignal function read_excel() with argument n_max that defaults to all rows. Due to lack of imagination, this extended function is named read_excel2.

    Just replace read_excel with read_excel2 to evaluate column types by all rows.

    # Inspiration: https://github.com/hadley/readxl/blob/master/R/read_excel.R 
    # Rewrote read_excel() to read_excel2() with additional argument 'n_max' for number
    # of rows to evaluate in function readxl:::xls_col_types and
    # readxl:::xlsx_col_types()
    # This is probably an unstable solution, since it calls internal functions from readxl.
    # May or may not survive next update of readxl. Seems to work in version 0.1.0
    library(readxl)
    
    read_excel2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                           na = "", skip = 0, n_max = 1050000L) {
    
      path <- readxl:::check_file(path)
      ext <- tolower(tools::file_ext(path))
    
      switch(readxl:::excel_format(path),
             xls =  read_xls2(path, sheet, col_names, col_types, na, skip, n_max),
             xlsx = read_xlsx2(path, sheet, col_names, col_types, na, skip, n_max)
      )
    }
    read_xls2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                         na = "", skip = 0, n_max = n_max) {
    
      sheet <- readxl:::standardise_sheet(sheet, readxl:::xls_sheets(path))
    
      has_col_names <- isTRUE(col_names)
      if (has_col_names) {
        col_names <- readxl:::xls_col_names(path, sheet, nskip = skip)
      } else if (readxl:::isFALSE(col_names)) {
        col_names <- paste0("X", seq_along(readxl:::xls_col_names(path, sheet)))
      }
    
      if (is.null(col_types)) {
        col_types <- readxl:::xls_col_types(
          path, sheet, na = na, nskip = skip, has_col_names = has_col_names, n = n_max
        )
      }
    
      readxl:::xls_cols(path, sheet, col_names = col_names, col_types = col_types, 
                        na = na, nskip = skip + has_col_names)
    }
    
    read_xlsx2 <- function(path, sheet = 1L, col_names = TRUE, col_types = NULL,
                           na = "", skip = 0, n_max = n_max) {
      path <- readxl:::check_file(path)
      sheet <-
        readxl:::standardise_sheet(sheet, readxl:::xlsx_sheets(path))
    
      if (is.null(col_types)) {
        col_types <-
          readxl:::xlsx_col_types(
            path = path, sheet = sheet, na = na, nskip = skip + isTRUE(col_names), n = n_max
          )
      }
    
      readxl:::read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, na = na,
                 nskip = skip)
    }
    

    You might get an evil performance hit because of this extended guessing. Haven't tried on really big data sets yet, just tried on smaller data enought to verify function.

提交回复
热议问题