Fastest way to read in 100,000 .dat.gz files

后端未结

关注

 3  1426

I have a few hundred thousand very small .dat.gz files that I want to read into R in the most efficient way possible. I read in the file and then immediately ag

相关标签:

3条回答

自闭症患者

2020-12-05 12:15
I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.
```
tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
误落风尘

2020-12-05 12:20
R has the ability to read gzipped files natively, using the gzfile function. See if this works.
```
rbindlist(lapply(dat.files, function(f) {
    read.delim(gzfile(f))
}))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-12-05 12:24

The bottleneck might be caused by the use of the system() call to an external application.

You should try using the builting functions to extract the archive. This answer explains how: Decompress gz file using R

0 讨论(0)
发布评论:

提交评论
- 加载中...