How to add a index by set of data when using rbindlist?

问题

I have several different csv files with the same structure. I read them into R using fread, and then union them into a bigger dataset using rbindlist().

files <- list.files( pattern = "*.csv" );
x2csv <- rbindlist( lapply(files, fread, stringsAsFactors=FALSE), fill = TRUE )

The code works weel. However, I would like to add a column filled with numbers to indicate from which csv file that observation came from. For exemple, the output should be:

       V1        V2         V3  C1
   1:   0 0.2859163 0.55848521   1
   2:   1 1.1616298 0.87571349   1 
   3:   2 2.1122510 0.95062116   2 
   4:   3 2.6832013 0.57095035   2
   5:   4 2.9117493 0.22854804   2 
   6:   5 2.9886040 0.07685464   3

where C1 is the new index column telling that: the first and second observations come from files[1] (the first .csv file); the third and fourth observation come from files[1] (the first .csv file); and so on.

回答1:

This is an enhanced version of Nicolás' answer which adds the file names instead of numbers:

x2csv <- rbindlist(lapply(files, fread), idcol = "origin")
x2csv[, origin := factor(origin, labels = basename(files))]

fread() uses stringsAsFactors = FALSE by default so we can save some keystrokes
Also fill = TRUE is only required if we want to read files with differing structure, e.g., differing position, name, or number of columns
The id col can be named (the default is .id) and is populated with the sequence number of the list element.
Then, this number is converted into a factor whose levels are labeled with the file names. A file name might be easier to remember than just a mere number. basename() strips the path off the file name.

回答2:

You are only missing the idcol argument from rbindlist(). Run:

x2csv <- rbindlist(lapply(files, fread, stringsAsFactors = FALSE), fill = TRUE, idcol = TRUE )

来源：https://stackoverflow.com/questions/49100250/add-file-name-to-appended-dataset-for-each-file-in-r

标签

csv

data.table