Reading Excel in R: how to find the start cell in messy spreadsheets

后端未结

关注

 7  1746

暗喜 2020-12-28 10:17

I\'m trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first co

7条回答

攒了一身酷 (楼主)

2020-12-28 10:20

Here is how I would tackle it.

STEP 1
Read the excel spreadsheet in without the headers.

STEP 2
Find the row index for your string Monthly return in this case

STEP 3
Filter from the identified row (or column or both), prettify a little and done.

Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.

Function Definition:

library(xlsx)
extract_return <-  function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
                       filepath = paste(path, "/", filename, sep = "")
                       input = read.xlsx(filepath, sheetnum, header = FALSE)
                       start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
                       output = input[start_idx:dim(input)[1],]
                       rownames(output) <- NULL
                       colnames(output) <- c("Date","Monthly Return")
                       output = output[-1, ]  
                       return(output)
                  }

Example:

final_df <- extract_return(
                path = "~/Desktop", 
                filename = "Apr2017.xlsx", 
                sheetnum = 2)

No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.

0 讨论(0)

查看其它7个回答