问题
Is there anyway to combine multiple CSV files together into a super file without using the read.csv/read_csv functions?
I want to combine all the tables (CSV) in the folder into one csv file, since each of them represents a separate month. The folder looks like this:
list.files(folder)
[1] "2013-07 - Citi Bike trip data.csv" "2013-08 - Citi Bike trip data.csv" "2013-09 - Citi Bike trip data.csv"
[4] "2013-10 - Citi Bike trip data.csv" "2013-11 - Citi Bike trip data.csv" "2013-12 - Citi Bike trip data.csv"
[7] "2014-01 - Citi Bike trip data.csv" "2014-02 - Citi Bike trip data.csv" "2014-03 - Citi Bike trip data.csv"
[10] "2014-04 - Citi Bike trip data.csv" "2014-05 - Citi Bike trip data.csv" "2014-06 - Citi Bike trip data.csv"
[13] "2014-07 - Citi Bike trip data.csv" "2014-08 - Citi Bike trip data.csv" "201409-citibike-tripdata.csv"
[16] "201410-citibike-tripdata.csv" "201411-citibike-tripdata.csv" "201412-citibike-tripdata.csv"
[19] "201501-citibike-tripdata.csv" "201502-citibike-tripdata.csv" "201503-citibike-tripdata.csv"
[22] "201504-citibike-tripdata.csv" "201505-citibike-tripdata.csv" "201506-citibike-tripdata.csv"
[25] "201507-citibike-tripdata.csv" "201508-citibike-tripdata.csv" "201509-citibike-tripdata.csv"
[28] "201510-citibike-tripdata.csv" "201511-citibike-tripdata.csv" "201512-citibike-tripdata.csv"
[31] "201601-citibike-tripdata.csv" "201602-citibike-tripdata.csv" "201603-citibike-tripdata.csv"
I tried the following and did get the big data, which is a large list of 33 elements and 3.6 Gbs. However, the full process took a while. Considering the fact that the website is updated monthly, the increasing data size will make the merging process more slowly. Thus, could someone help me combine all the data files together without loading them into the environment? The data source could be found here: https://s3.amazonaws.com/tripdata/index.html.
filenames<- list.files(folder, full.names =TRUE)
data<- lapply(filenames,read_csv)
The data file looks like this, which is not the form I want. I would like to have a big table with all the information merged together.
> head(data)
[[1]]
Source: local data frame [843,416 x 15]
tripduration starttime stoptime start station id start station name start station latitude
(int) (time) (time) (int) (chr) (dbl)
1 634 2013-07-01 00:00:00 2013-07-01 00:10:34 164 E 47 St & 2 Ave 40.75323
2 1547 2013-07-01 00:00:02 2013-07-01 00:25:49 388 W 26 St & 10 Ave 40.74972
3 178 2013-07-01 00:01:04 2013-07-01 00:04:02 293 Lafayette St & E 8 St 40.73029
4 1580 2013-07-01 00:01:06 2013-07-01 00:27:26 531 Forsyth St & Broome St 40.71894
5 757 2013-07-01 00:01:10 2013-07-01 00:13:47 382 University Pl & E 14 St 40.73493
6 861 2013-07-01 00:01:23 2013-07-01 00:15:44 511 E 14 St & Avenue B 40.72939
7 550 2013-07-01 00:01:59 2013-07-01 00:11:09 293 Lafayette St & E 8 St 40.73029
8 288 2013-07-01 00:02:16 2013-07-01 00:07:04 224 Spruce St & Nassau St 40.71146
9 766 2013-07-01 00:02:16 2013-07-01 00:15:02 432 E 7 St & Avenue A 40.72622
10 773 2013-07-01 00:02:23 2013-07-01 00:15:16 173 Broadway & W 49 St 40.76065
.. ... ... ... ... ... ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)
[[2]]
Source: local data frame [1,001,958 x 15]
tripduration starttime stoptime start station id start station name start station latitude
(int) (time) (time) (int) (chr) (dbl)
1 664 2013-08-01 00:00:00 2013-08-01 00:11:04 449 W 52 St & 9 Ave 40.76462
2 2115 2013-08-01 00:00:01 2013-08-01 00:35:16 254 W 11 St & 6 Ave 40.73532
3 385 2013-08-01 00:00:03 2013-08-01 00:06:28 460 S 4 St & Wythe Ave 40.71286
4 653 2013-08-01 00:00:10 2013-08-01 00:11:03 398 Atlantic Ave & Furman St 40.69165
5 954 2013-08-01 00:00:11 2013-08-01 00:16:05 319 Park Pl & Church St 40.71336
6 145 2013-08-01 00:00:37 2013-08-01 00:03:02 521 8 Ave & W 31 St 40.75045
7 331 2013-08-01 00:01:25 2013-08-01 00:06:56 2000 Front St & Washington St 40.70255
8 194 2013-08-01 00:01:26 2013-08-01 00:04:40 313 Washington Ave & Park Ave 40.69610
9 598 2013-08-01 00:01:40 2013-08-01 00:11:38 528 2 Ave & E 31 St 40.74291
10 360 2013-08-01 00:01:45 2013-08-01 00:07:45 500 Broadway & W 51 St 40.76229
.. ... ... ... ... ... ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)
回答1:
You don't need to load each csv into R. Combine the csvs outside of R and then load the files all at once. Here's a shell script that'll do the job if you have access to unix commands (solution from here).
nawk 'FNR==1 && NR!=1{next;}{print}' *.csv > master.csv
Or using windows command prompt (solution from here):
@echo off
setlocal
set first=1
>master.csv.tmp (
for %%F in (*.csv) do (
if defined first (
type "%%F"
set "first="
) else more +1 "%%F"
)
)
move /y master.csv.tmp master.csv >nul
回答2:
You have a list of data frames. So if you want to melt those data frames into one big data frame, then do:
dplyr::bind_rows(data)
On the other hand, you can concatenate the CSVs themselves outside of R using cat
(as suggested above). But you can call that from within R like this:
setwd(folder)
system("cat *.csv > full.csv")
The only problem is that the column headers will be duplicated for each of the files that you concatenated, which you might not want.
回答3:
You can use the CMD
and simply write :
C:\yourdirWhereCsvfilesExist\copy *.csv combinedfile.csv
then you will have a single file called combinedfile.csv
with all the data
I hope that's will be helpful for you!
回答4:
I would use this one:
library(data.table)
multmerge = function(path){
filenames=list.files(path=path, full.names=TRUE)
rbindlist(lapply(filenames, fread))
}
path <- "C:/Users/kkk/Desktop/test/test1"
mergeA <- multmerge(path)
write.csv(mergeA, "mergeA.csv")
That solution was posted under a different thread as a way to merge multiple files
来源:https://stackoverflow.com/questions/37444120/how-to-combine-multiple-csv-files-into-one-big-file-without-loading-the-actual-f