Fastest way to read in 100,000 .dat.gz files

后端 未结 3 1426
耶瑟儿~
耶瑟儿~ 2020-12-05 11:45

I have a few hundred thousand very small .dat.gz files that I want to read into R in the most efficient way possible. I read in the file and then immediately ag

相关标签:
3条回答
  • 2020-12-05 12:15

    I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.

    tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
    tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
    setnames(tbl, tblNames)
    tbl
    
    0 讨论(0)
  • 2020-12-05 12:20

    R has the ability to read gzipped files natively, using the gzfile function. See if this works.

    rbindlist(lapply(dat.files, function(f) {
        read.delim(gzfile(f))
    }))
    
    0 讨论(0)
  • 2020-12-05 12:24

    The bottleneck might be caused by the use of the system() call to an external application.

    You should try using the builting functions to extract the archive. This answer explains how: Decompress gz file using R

    0 讨论(0)
提交回复
热议问题