how to write tryCatch() function when extracting data from multiple files?

问题

I am currently working on thousands of files (with .MOD extension) where I want to extract specific information from all these files. These information will then be collected into one excel sheet in such a way that each row represents information extracted from one .MOD file. I have managed to do this.

However, there are lets say about 10-20 files (out of the tens of thousands) that do not contain information in the format that I want, and this therefore throws an error. I cannot of course manually keep digging into all the files, or cannot subset them each time to find which of these files is throwing the error. Therefore, I want to include a tryCatch() function, so that the script still continues to run without stopping. For the files that give error, I simply want the values to be replaced by "Error" in those specific cells. Can anyone help me how to do that?

Following is how I want my final excel output to look like:

ID  COL1    COL2    COL3    COL4    COL5    COL6    COL7    COL8
Sample1 9-5-2014    10:42:41    600 1207    3   2   62  30
Sample2 8-1-2013    08:44:50    654 1873    1   7   60  45
Sample3 2-3-2013    14:47:40    767 1645    1   18  66  37
Sample4 8-2-2013    08:50:45    727 1500    1   8   68  45
Sample5 4-1-2013    13:08:49    Error   Error   Error   Error   Error   Error
Sample6 1-2-2013    13:08:47    720 1433    1   16  60  51
Sample7 3-4-2013    13:59:04    610 1343    2   13  66  32

Following is my code (along with the error):

AR.MOD.files <- list.files(pattern = "AR.MOD|ar.MOD")
    for (fileName in AR.MOD.files) {
    AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
    AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
    names(AR.MOD.subset1) <- c("COL1", "COL2")
    AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
    names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
    AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
    ID <- basename(fileName)
    AR.MOD.final <- merge (ID, AR.MOD.final)
    colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"
    if(match(fileName,AR.MOD.files)==1){
            output.AR.MOD <- AR.MOD.final
        }else{
            output.AR.MOD <- rbind(output.AR.MOD,AR.MOD.final)}
        }
Error in `[.data.frame`(AR.MOD, c(3), 3:8) : undefined columns selected
    output.AR.MOD$ID <- gsub("AR.MOD", "", paste(output.AR.MOD$ID))
    output.AR.MOD$ID <- gsub("ar.MOD", "", paste(output.AR.MOD$ID))
    print(output.AR.MOD)

I here share 2 example files:

> AR.MOD <- read.table("Sample1ar.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2        V3       V4   V5    V6   V7    V8
1 Case  1 23-3-2013 14:47:40                      
2  Run NA                                         
3    R  1    767,96  1647,72 1,78 18,88 0,66 37,33

> AR.MOD <- read.table("Sample2AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2       V3       V4   V5   V6   V7    V8
1 Case  1 9-5-2014 10:42:41                     
2  Run NA                                       
3    R  1   566,47  1207,22 3,05 2,95 0,62 30,00

It works with the above 2 examples. However, if one of the column is missing, lets say in the following, then it throws error.

> AR.MOD <- read.table("Sample3AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2        V3      V4   V5   V6   V7
1 Case  1 28-1-2013 8:44:50                     
2  Run NA                                       
3    R  1    783,76 1873,70 1,34 7,48 0,60

I am at this point not sure which file it is coming from, but I here send you a dummy example in the 3rd sample from above. I am not able to attach files directly here, that is why I read it and send you as an output.

回答1:

Here's an approach.

writeLines("a,b\n1,2", "Letin_good.csv")
writeLines("", "Letin_bad1.csv")
writeLines("c,d\n3,4", "Letin_bad2.csv")

# myfiles <- list.files(pattern = "Letin.*\\.csv", full.names = TRUE)
myfiles <- c("Letin_good.csv", "Letin_good.csv", "Letin_bad1.csv", "Letin_good.csv", "Letin_bad2.csv")

datlist <- lapply(myfiles, function(fn) {
  tryCatch({
    out <- read.csv(fn, header=TRUE, stringsAsFactors=FALSE)
    # do something with the data
    out
  },
    error = function(e) NULL)
})
str(datlist)
# List of 5
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: int 1
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: int 1
#   ..$ b: int 2
#  $ : NULL
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: int 1
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ c: int 3
#   ..$ d: int 4

At this point, the third element is clearly wrong (read.csv failed) and the fifth element is incorrect (wrong headers). We can generate a filter of sorts that returns TRUE if all "conditions" are met (e.g., all required names present):

gooddatlist <- Filter(function(x) {
  all(
    c("a", "b") %in% names(x)
    # other tests
  )
}, datlist)
str(gooddatlist)
# List of 3
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: num 12
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: num 12
#   ..$ b: int 2
#  $ :'data.frame': 1 obs. of  2 variables:
#   ..$ a: num 12
#   ..$ b: int 2

alldat <- do.call(rbind, gooddatlist)

回答2:

I'd echo the lapply solution to make the tables in individual list elements and then handle the combination afterwards. Here is an example using the data.table package that fills the data with NA's where it can't find it:

# # for installing:
# install.packages(data.table)
library(data.table)

# generate tables with uneven columns
set.seed(1)
tables <- lapply(1:10, function(i){
  ncols <- sample(1:5, 1, 1)
  out <- as.data.frame(matrix(runif(ncols), nrow=1, ncol=ncols))
})

# you can use rbindlist with fill=TRUE to fill the bad values with NA
output <- as.data.frame(rbindlist(tables, fill=TRUE))

EDIT: I can't be certain this will work off the bat, but give it a try:

# # for installing:
# install.packages(data.table)
library(data.table)

# Set this to what you expect max to be 
ncol_total <- 9
tables <- lapply(AR.MOD.files, function(fileName){
  AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
  AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
  names(AR.MOD.subset1) <- c("COL1", "COL2")
  AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
  names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
  AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
  ID <- basename(fileName)
  AR.MOD.final <- merge (ID, AR.MOD.final)
  colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"

  # add in missing data
  ncol_file <- ncol(AR.MOD.final)
  missing <- ncol_total - ncol_file
  if(missing > 0){
    new_data <- as.data.frame(matrix("Error", nrow=nrow(AR.MOD.final), ncol=missing))
    AR.MOD.final <- cbind(AR.MOD.final, AR.MOD.final)
  }

  AR.MOD.final
})

# this will likely screw up the column names. Its better to know what these
# are and assign after, as long as the tables are all in the same order
output <- as.data.frame(rbindlist(tables, use.names = FALSE))
names(output) <- c("ID", "COL1", "COL2", "COL3", "COL4", "COL5", "COL6", "COL7"
                   "COL8")

# continuing on
output$ID <- gsub("AR.MOD", "", paste(output$ID))
output$ID <- gsub("ar.MOD", "", paste(output$ID))
print(output)

来源：https://stackoverflow.com/questions/57824494/how-to-write-trycatch-function-when-extracting-data-from-multiple-files

标签

try-catch