I\'m attempting to merge multiple csv files using R. all of the CSV files have the same fields and are all a shared folder only containing these CSV files. I\'ve attempted
For a shorter, faster solution
library(dplyr)
library(readr)
df <- list.files(path="yourpath", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
I tried working with the same function but included the all=TRUE
in the merge function and worked just fine.
The code I used is as follows:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y,all = TRUE)}, datalist)
}
full_data = multmerge("path_name for your csv folder")
Hope this helps. Cheers!
Another option that has proved to work for my setup:
multmerge = function(path){
filenames=list.files(path=path, full.names=TRUE)
rbindlist(lapply(filenames, fread))
}
path <- "Dropbox/rstudio-share/dataset/MB"
DF <- multmerge(path)
If you need a much granular control of your CSV file during the loading process you can change the fread
by a function like so:
multmerge = function(path){
filenames=list.files(path=path, full.names=TRUE)
rbindlist(lapply(filenames, function(x){read.csv(x, stringsAsFactors = F, sep=';')}))
}
Let me give you the best I have ever had:
library(pacman)
p_load(doParallel,data.table,dplyr,stringr,fst)
# get the file name
dir() %>% str_subset("\\.csv$") -> fn
# use parallel setting
(cl = detectCores() %>%
makeCluster()) %>%
registerDoParallel()
# read and bind
system.time({
big_df = foreach(i = fn,
.packages = "data.table") %dopar% {
fread(i,colClasses = "chracter")
} %>%
rbindlist(fill = T)
})
# end of parallel work
stopImplicitCluster(cl)
This should be faster as long as you have more cores in your computer.If you are dealing with big data, it is preferred.
If all your csv files have exactly the same fields (column names) and you want simply to combine them vertically, you should use rbind
instead of merge
:
> a
A B
[1,] 2.471202 38.949232
[2,] 16.935362 6.343694
> b
A B
[1,] 0.704630 0.1132538
[2,] 4.477572 11.8869057
> rbind(a, b)
A B
[1,] 2.471202 38.9492316
[2,] 16.935362 6.3436939
[3,] 0.704630 0.1132538
[4,] 4.477572 11.8869057
Your code worked for me, but you need change header = True
to header = TRUE
.