R, rbind with multiple files defined by a variable

南笙酒味 提交于 2019-12-02 16:29:59

问题


First off, this is related to a homework question for the Coursera R programming course. I have found other ways to do what I want to do but my research has led me to a question I'm curious about. I have a variable number of csv files that I need to pull data from and then take the mean of the "pollutant" column in said files. The files are listed in their directory with an id number. I put together the following code which works fine for a single csv file but doesn't work for multiple csv files:

pollutantmean <- function (directory, pollutant, id = 1:332) {
  id <- formatC(id, width=3, flag="0")`
  dataset<-read.csv(paste(directory, "/", id,".csv",sep=""),header=TRUE)`
  mean(dataset[,pollutant], na.rm = TRUE)`
}

I also know how to rbind multiple csv files together if I know the ids when I am creating the function, but I am not sure how to assign rbind to a variable range of ids or if thats even possible. I found other ways to do it such as calling an lapply and the unlisting the data, just curious if there is an easier way.


回答1:


Well, this uses an lapply, but it might be what you want.

file_list <- list.files("*your directory*", full.names = T)

combined_data <- do.call(rbind, lapply(file_list, read.csv, header = TRUE))

This will turn all of your files into one large dataset, and from there it's easy to take the mean. Is that what you wanted?

An alternative way of doing this would be to step through file by file, taking sums and number of observations and then taking the mean afterwards, like so:

sums <- numeric()
n <- numeric()
i <- 1
for(file in file_list){
  temp_df <- read.csv(file, header = T)
  temp_mean <- mean(temp_df$pollutant)
  sums[i] <- sum(temp_df$pollutant)
  n[i] <- nrow(temp_df)
  i <- i + 1
}
new_mean <- sum(sums)/sum(n)

Note that both of these methods require that only your desired csvs are in that folder. You can use a pattern argument in the list.files call if you have other files in there that you're not interested in.




回答2:


A vector is not accepted for 'file' in read.csv(file, ...)

Below is a slight modification of yours. A vector of file paths are created and they are looped by sapply.

files <- paste("directory-name/",formatC(1:332, width=3, flag="0"),
               ".csv",sep="")
pollutantmean <- function(file, pollutant) {
    dataset <- read.csv(file, header = TRUE)
    mean(dataset[, pollutant], na.rm = TRUE)
}
sapply(files, pollutantmean)


来源:https://stackoverflow.com/questions/29738637/r-rbind-with-multiple-files-defined-by-a-variable

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!