How to combine multiple csv files into one big file without loading the actual file into the environment?

问题

Is there anyway to combine multiple CSV files together into a super file without using the read.csv/read_csv functions?

I want to combine all the tables (CSV) in the folder into one csv file, since each of them represents a separate month. The folder looks like this:

list.files(folder)

 [1] "2013-07 - Citi Bike trip data.csv" "2013-08 - Citi Bike trip data.csv" "2013-09 - Citi Bike trip data.csv"
 [4] "2013-10 - Citi Bike trip data.csv" "2013-11 - Citi Bike trip data.csv" "2013-12 - Citi Bike trip data.csv"
 [7] "2014-01 - Citi Bike trip data.csv" "2014-02 - Citi Bike trip data.csv" "2014-03 - Citi Bike trip data.csv"
[10] "2014-04 - Citi Bike trip data.csv" "2014-05 - Citi Bike trip data.csv" "2014-06 - Citi Bike trip data.csv"
[13] "2014-07 - Citi Bike trip data.csv" "2014-08 - Citi Bike trip data.csv" "201409-citibike-tripdata.csv"     
[16] "201410-citibike-tripdata.csv"      "201411-citibike-tripdata.csv"      "201412-citibike-tripdata.csv"     
[19] "201501-citibike-tripdata.csv"      "201502-citibike-tripdata.csv"      "201503-citibike-tripdata.csv"     
[22] "201504-citibike-tripdata.csv"      "201505-citibike-tripdata.csv"      "201506-citibike-tripdata.csv"     
[25] "201507-citibike-tripdata.csv"      "201508-citibike-tripdata.csv"      "201509-citibike-tripdata.csv"     
[28] "201510-citibike-tripdata.csv"      "201511-citibike-tripdata.csv"      "201512-citibike-tripdata.csv"     
[31] "201601-citibike-tripdata.csv"      "201602-citibike-tripdata.csv"      "201603-citibike-tripdata.csv"

I tried the following and did get the big data, which is a large list of 33 elements and 3.6 Gbs. However, the full process took a while. Considering the fact that the website is updated monthly, the increasing data size will make the merging process more slowly. Thus, could someone help me combine all the data files together without loading them into the environment? The data source could be found here: https://s3.amazonaws.com/tripdata/index.html.

filenames<- list.files(folder, full.names =TRUE)
data<- lapply(filenames,read_csv)

The data file looks like this, which is not the form I want. I would like to have a big table with all the information merged together.

> head(data)
[[1]]
Source: local data frame [843,416 x 15]

   tripduration           starttime            stoptime start station id      start station name start station latitude
          (int)              (time)              (time)            (int)                   (chr)                  (dbl)
1           634 2013-07-01 00:00:00 2013-07-01 00:10:34              164         E 47 St & 2 Ave               40.75323
2          1547 2013-07-01 00:00:02 2013-07-01 00:25:49              388        W 26 St & 10 Ave               40.74972
3           178 2013-07-01 00:01:04 2013-07-01 00:04:02              293   Lafayette St & E 8 St               40.73029
4          1580 2013-07-01 00:01:06 2013-07-01 00:27:26              531  Forsyth St & Broome St               40.71894
5           757 2013-07-01 00:01:10 2013-07-01 00:13:47              382 University Pl & E 14 St               40.73493
6           861 2013-07-01 00:01:23 2013-07-01 00:15:44              511      E 14 St & Avenue B               40.72939
7           550 2013-07-01 00:01:59 2013-07-01 00:11:09              293   Lafayette St & E 8 St               40.73029
8           288 2013-07-01 00:02:16 2013-07-01 00:07:04              224   Spruce St & Nassau St               40.71146
9           766 2013-07-01 00:02:16 2013-07-01 00:15:02              432       E 7 St & Avenue A               40.72622
10          773 2013-07-01 00:02:23 2013-07-01 00:15:16              173      Broadway & W 49 St               40.76065
..          ...                 ...                 ...              ...                     ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

[[2]]
Source: local data frame [1,001,958 x 15]

   tripduration           starttime            stoptime start station id        start station name start station latitude
          (int)              (time)              (time)            (int)                     (chr)                  (dbl)
1           664 2013-08-01 00:00:00 2013-08-01 00:11:04              449           W 52 St & 9 Ave               40.76462
2          2115 2013-08-01 00:00:01 2013-08-01 00:35:16              254           W 11 St & 6 Ave               40.73532
3           385 2013-08-01 00:00:03 2013-08-01 00:06:28              460        S 4 St & Wythe Ave               40.71286
4           653 2013-08-01 00:00:10 2013-08-01 00:11:03              398  Atlantic Ave & Furman St               40.69165
5           954 2013-08-01 00:00:11 2013-08-01 00:16:05              319       Park Pl & Church St               40.71336
6           145 2013-08-01 00:00:37 2013-08-01 00:03:02              521           8 Ave & W 31 St               40.75045
7           331 2013-08-01 00:01:25 2013-08-01 00:06:56             2000  Front St & Washington St               40.70255
8           194 2013-08-01 00:01:26 2013-08-01 00:04:40              313 Washington Ave & Park Ave               40.69610
9           598 2013-08-01 00:01:40 2013-08-01 00:11:38              528           2 Ave & E 31 St               40.74291
10          360 2013-08-01 00:01:45 2013-08-01 00:07:45              500        Broadway & W 51 St               40.76229
..          ...                 ...                 ...              ...                       ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

回答1:

You don't need to load each csv into R. Combine the csvs outside of R and then load the files all at once. Here's a shell script that'll do the job if you have access to unix commands (solution from here).

nawk 'FNR==1 && NR!=1{next;}{print}' *.csv > master.csv

Or using windows command prompt (solution from here):

@echo off
setlocal
set first=1
>master.csv.tmp (
  for %%F in (*.csv) do (
    if defined first (
      type "%%F"
      set "first="
    ) else more +1 "%%F"
  )
)
move /y master.csv.tmp master.csv >nul

回答2:

You have a list of data frames. So if you want to melt those data frames into one big data frame, then do:

dplyr::bind_rows(data)

On the other hand, you can concatenate the CSVs themselves outside of R using cat (as suggested above). But you can call that from within R like this:

setwd(folder)
system("cat *.csv > full.csv")

The only problem is that the column headers will be duplicated for each of the files that you concatenated, which you might not want.

回答3:

You can use the CMD and simply write :

C:\yourdirWhereCsvfilesExist\copy *.csv combinedfile.csv

then you will have a single file called combinedfile.csv with all the data

I hope that's will be helpful for you!

回答4:

I would use this one:

library(data.table)
multmerge = function(path){
  filenames=list.files(path=path, full.names=TRUE)
  rbindlist(lapply(filenames, fread))
} 
path <- "C:/Users/kkk/Desktop/test/test1"
mergeA <- multmerge(path)
write.csv(mergeA, "mergeA.csv")

That solution was posted under a different thread as a way to merge multiple files

来源：https://stackoverflow.com/questions/37444120/how-to-combine-multiple-csv-files-into-one-big-file-without-loading-the-actual-f

标签

csv

merge

bigdata