Reading Excel in R: how to find the start cell in messy spreadsheets

后端 未结 7 1746
暗喜
暗喜 2020-12-28 10:17

I\'m trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first co

7条回答
  •  攒了一身酷
    2020-12-28 10:20

    Here is how I would tackle it.

    STEP 1
    Read the excel spreadsheet in without the headers.

    STEP 2
    Find the row index for your string Monthly return in this case

    STEP 3
    Filter from the identified row (or column or both), prettify a little and done.

    Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.

    Function Definition:

    library(xlsx)
    extract_return <-  function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
                           filepath = paste(path, "/", filename, sep = "")
                           input = read.xlsx(filepath, sheetnum, header = FALSE)
                           start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
                           output = input[start_idx:dim(input)[1],]
                           rownames(output) <- NULL
                           colnames(output) <- c("Date","Monthly Return")
                           output = output[-1, ]  
                           return(output)
                      }
    

    Example:

    final_df <- extract_return(
                    path = "~/Desktop", 
                    filename = "Apr2017.xlsx", 
                    sheetnum = 2)
    

    No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.

提交回复
热议问题