I have a few hundred thousand very small .dat.gz
files that I want to read into R in the most efficient way possible. I read in the file and then immediately ag
I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.
tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl
R has the ability to read gzipped files natively, using the gzfile
function. See if this works.
rbindlist(lapply(dat.files, function(f) {
read.delim(gzfile(f))
}))
The bottleneck might be caused by the use of the system() call to an external application.
You should try using the builting functions to extract the archive. This answer explains how: Decompress gz file using R