问题
I am currently working on thousands of files (with .MOD extension) where I want to extract specific information from all these files. These information will then be collected into one excel sheet in such a way that each row represents information extracted from one .MOD file. I have managed to do this.
However, there are lets say about 10-20 files (out of the tens of thousands) that do not contain information in the format that I want, and this therefore throws an error. I cannot of course manually keep digging into all the files, or cannot subset them each time to find which of these files is throwing the error. Therefore, I want to include a tryCatch() function, so that the script still continues to run without stopping. For the files that give error, I simply want the values to be replaced by "Error" in those specific cells. Can anyone help me how to do that?
Following is how I want my final excel output to look like:
ID COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8
Sample1 9-5-2014 10:42:41 600 1207 3 2 62 30
Sample2 8-1-2013 08:44:50 654 1873 1 7 60 45
Sample3 2-3-2013 14:47:40 767 1645 1 18 66 37
Sample4 8-2-2013 08:50:45 727 1500 1 8 68 45
Sample5 4-1-2013 13:08:49 Error Error Error Error Error Error
Sample6 1-2-2013 13:08:47 720 1433 1 16 60 51
Sample7 3-4-2013 13:59:04 610 1343 2 13 66 32
Following is my code (along with the error):
AR.MOD.files <- list.files(pattern = "AR.MOD|ar.MOD")
for (fileName in AR.MOD.files) {
AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
names(AR.MOD.subset1) <- c("COL1", "COL2")
AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
ID <- basename(fileName)
AR.MOD.final <- merge (ID, AR.MOD.final)
colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"
if(match(fileName,AR.MOD.files)==1){
output.AR.MOD <- AR.MOD.final
}else{
output.AR.MOD <- rbind(output.AR.MOD,AR.MOD.final)}
}
Error in `[.data.frame`(AR.MOD, c(3), 3:8) : undefined columns selected
output.AR.MOD$ID <- gsub("AR.MOD", "", paste(output.AR.MOD$ID))
output.AR.MOD$ID <- gsub("ar.MOD", "", paste(output.AR.MOD$ID))
print(output.AR.MOD)
I here share 2 example files:
> AR.MOD <- read.table("Sample1ar.MOD", header = FALSE, fill = TRUE)
> AR.MOD
V1 V2 V3 V4 V5 V6 V7 V8
1 Case 1 23-3-2013 14:47:40
2 Run NA
3 R 1 767,96 1647,72 1,78 18,88 0,66 37,33
> AR.MOD <- read.table("Sample2AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
V1 V2 V3 V4 V5 V6 V7 V8
1 Case 1 9-5-2014 10:42:41
2 Run NA
3 R 1 566,47 1207,22 3,05 2,95 0,62 30,00
It works with the above 2 examples. However, if one of the column is missing, lets say in the following, then it throws error.
> AR.MOD <- read.table("Sample3AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
V1 V2 V3 V4 V5 V6 V7
1 Case 1 28-1-2013 8:44:50
2 Run NA
3 R 1 783,76 1873,70 1,34 7,48 0,60
I am at this point not sure which file it is coming from, but I here send you a dummy example in the 3rd sample from above. I am not able to attach files directly here, that is why I read it and send you as an output.
回答1:
Here's an approach.
writeLines("a,b\n1,2", "Letin_good.csv")
writeLines("", "Letin_bad1.csv")
writeLines("c,d\n3,4", "Letin_bad2.csv")
# myfiles <- list.files(pattern = "Letin.*\\.csv", full.names = TRUE)
myfiles <- c("Letin_good.csv", "Letin_good.csv", "Letin_bad1.csv", "Letin_good.csv", "Letin_bad2.csv")
datlist <- lapply(myfiles, function(fn) {
tryCatch({
out <- read.csv(fn, header=TRUE, stringsAsFactors=FALSE)
# do something with the data
out
},
error = function(e) NULL)
})
str(datlist)
# List of 5
# $ :'data.frame': 1 obs. of 2 variables:
# ..$ a: int 1
# ..$ b: int 2
# $ :'data.frame': 1 obs. of 2 variables:
# ..$ a: int 1
# ..$ b: int 2
# $ : NULL
# $ :'data.frame': 1 obs. of 2 variables:
# ..$ a: int 1
# ..$ b: int 2
# $ :'data.frame': 1 obs. of 2 variables:
# ..$ c: int 3
# ..$ d: int 4
At this point, the third element is clearly wrong (read.csv
failed) and the fifth element is incorrect (wrong headers). We can generate a filter of sorts that returns TRUE
if all "conditions" are met (e.g., all required names present):
gooddatlist <- Filter(function(x) {
all(
c("a", "b") %in% names(x)
# other tests
)
}, datlist)
str(gooddatlist)
# List of 3
# $ :'data.frame': 1 obs. of 2 variables:
# ..$ a: num 12
# ..$ b: int 2
# $ :'data.frame': 1 obs. of 2 variables:
# ..$ a: num 12
# ..$ b: int 2
# $ :'data.frame': 1 obs. of 2 variables:
# ..$ a: num 12
# ..$ b: int 2
alldat <- do.call(rbind, gooddatlist)
回答2:
I'd echo the lapply
solution to make the tables in individual list elements and then handle the combination afterwards. Here is an example using the data.table
package that fills the data with NA's where it can't find it:
# # for installing:
# install.packages(data.table)
library(data.table)
# generate tables with uneven columns
set.seed(1)
tables <- lapply(1:10, function(i){
ncols <- sample(1:5, 1, 1)
out <- as.data.frame(matrix(runif(ncols), nrow=1, ncol=ncols))
})
# you can use rbindlist with fill=TRUE to fill the bad values with NA
output <- as.data.frame(rbindlist(tables, fill=TRUE))
EDIT: I can't be certain this will work off the bat, but give it a try:
# # for installing:
# install.packages(data.table)
library(data.table)
# Set this to what you expect max to be
ncol_total <- 9
tables <- lapply(AR.MOD.files, function(fileName){
AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
names(AR.MOD.subset1) <- c("COL1", "COL2")
AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
ID <- basename(fileName)
AR.MOD.final <- merge (ID, AR.MOD.final)
colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"
# add in missing data
ncol_file <- ncol(AR.MOD.final)
missing <- ncol_total - ncol_file
if(missing > 0){
new_data <- as.data.frame(matrix("Error", nrow=nrow(AR.MOD.final), ncol=missing))
AR.MOD.final <- cbind(AR.MOD.final, AR.MOD.final)
}
AR.MOD.final
})
# this will likely screw up the column names. Its better to know what these
# are and assign after, as long as the tables are all in the same order
output <- as.data.frame(rbindlist(tables, use.names = FALSE))
names(output) <- c("ID", "COL1", "COL2", "COL3", "COL4", "COL5", "COL6", "COL7"
"COL8")
# continuing on
output$ID <- gsub("AR.MOD", "", paste(output$ID))
output$ID <- gsub("ar.MOD", "", paste(output$ID))
print(output)
来源:https://stackoverflow.com/questions/57824494/how-to-write-trycatch-function-when-extracting-data-from-multiple-files