What's wrong with my function to load multiple .csv files into single dataframe in R using rbind?

青春壹個敷衍的年華 提交于 2019-11-26 03:56:37

问题


I have written the following function to combine 300 .csv files. My directory name is \"specdata\". I have done the following steps for execution,

x <- function(directory) {     
    dir <- directory    
    data_dir <- paste(getwd(),dir,sep = \"/\")    
    files  <- list.files(data_dir,pattern = \'\\\\.csv\')    
    tables <- lapply(paste(data_dir,files,sep = \"/\"), read.csv, header = TRUE)    
    pollutantmean <- do.call(rbind , tables)         
}

# Step 2: call the function
x(\"specdata\")

# Step 3: inspect results
head(pollutantmean)

Error in head(pollutantmean) : object \'pollutantmean\' not found

What is my mistake? Can anyone please explain?


回答1:


There's a lot of unnecessary code in your function. You can simplify it to:

load_data <- function(path) { 
  files <- dir(path, pattern = '\\.csv', full.names = TRUE)
  tables <- lapply(files, read.csv)
  do.call(rbind, tables)
}

pollutantmean <- load_data("specdata")

Be aware that do.call + rbind is relatively slow. You might find dplyr::bind_rows or data.table::rbindlist to be substantially faster.




回答2:


To update Prof. Wickham's answer above with code from the more recent purrr library which he coauthored with Lionel Henry:

Tbl <-
    list.files(pattern="*.csv") %>% 
    map_df(~read_csv(.))

If the typecasting is being cheeky, you can force all the columns to be as characters with this.

Tbl <-
    list.files(pattern="*.csv") %>% 
    map_df(~read_csv(., col_types = cols(.default = "c")))

If you are wanting to dip into subdirectories to construct your list of files to eventually bind, then be sure to include the path name, as well as register the files with their full names in your list. This will allow the binding work to go on outside of the current directory. (Thinking of the full pathnames as operating like passports to allow movement back across directory 'borders'.)

Tbl <-
    list.files(path = "./subdirectory/",
               pattern="*.csv", 
               full.names = T) %>% 
    map_df(~read_csv(., col_types = cols(.default = "c"))) 

As Prof. Wickham describes here (about halfway down):

map_df(x, f) is effectively the same as do.call("rbind", lapply(x, f)) but under the hood is much more efficient.

and a thank you to Jake Kaupp for introducing me to map_df() here.




回答3:


This can be done very succinctly with dplyr and purrr from the tidyverse. Where x is a list of the names of your csv files you can simply use:

bind_rows(map(x, read.csv))

Mapping read.csv to x produces a list of dfs that bind_rows then neatly combines!




回答4:


```{r echo = FALSE, warning = FALSE, message = FALSE}

setwd("~/Data/R/BacklogReporting/data/PastDue/global/") ## where file are located

path = "~/Data/R/BacklogReporting/data/PastDue/global/"
out.file <- ""
file.names <- dir(path, pattern = ".csv")
for(i in 1:length(file.names)){
  file <- read.csv(file.names[i], header = TRUE, stringsAsFactors = FALSE)
  out.file <- rbind(out.file, file)
}

write.csv(out.file, file = "~/Data/R/BacklogReporting/data/PastDue/global/global_stacked/past_due_global_stacked.csv", row.names = FALSE) ## directory to write stacked file to

past_due_global_stacked <- read.csv("C:/Users/E550143/Documents/Data/R/BacklogReporting/data/PastDue/global/global_stacked/past_due_global_stacked.csv", stringsAsFactors = FALSE)

files <- list.files(pattern = "\\.csv$") %>%  t() %>% paste(collapse = ", ")
```



回答5:


If your csv files are into an other directory, you could use something like this:

readFilesInDirectory <- function(directory, pattern){
  files <- list.files(path = directory,pattern = pattern)
  for (f in files){
    file <- paste(directory,files, sep ="")
    temp <- lapply(file, fread, sep=",")
    data <- rbindlist( temp )
  }
  return(data)
}



回答6:


In your current function pollutantmean is available only in the scope of the function x. Modify your function to this

x <- function(directory) { 

    dir <- directory

    data_dir <- paste(getwd(),dir,sep = "/")

    files  <- list.files(data_dir,pattern = '\\.csv')

    tables <- lapply(paste(data_dir,files,sep = "/"), read.csv, header = TRUE)

    assign('pollutantmean',do.call(rbind , tables))

}

assign should put result of do.call(rbind, tables) into variable called pollutantmean in global environment.



来源:https://stackoverflow.com/questions/23190280/whats-wrong-with-my-function-to-load-multiple-csv-files-into-single-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!