Fast reading and combining several files using data.table (with fread)

好久不见. 提交于 2020-04-05 07:32:07

问题


I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.

## First put all file names into a list 
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")

## Read data using fread
readdata <- function(fn){
    dt_temp <- fread(fn, sep=",")
    keycols <- c("ID", "date")
    setkeyv(dt_temp,keycols)  # Notice there's a "v" after setkey with multiple keys
    return(dt_temp)

}
# then using 
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)

The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.

If I use the fread to read a single file, it's fast. But using apply, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?

I tried the llply in plyr package, there're not much speed gains.

Also, is there any syntax in data.table to achieve vertical join like rbind and union in sql?

Thanks.


回答1:


Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

And as @Roland says, do not set the key in each iteration of your function!

So in summary, this is best :

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )


来源:https://stackoverflow.com/questions/21156271/fast-reading-and-combining-several-files-using-data-table-with-fread

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!