R:Loops to process large dataset(GBs) in chunks?

前端未结

关注

 1  1689

I have a large data set in GBs that I\'d have to process before I analyse them. I tried creating a connector, which allows me to loop through the large datasets and extract

相关标签:

1条回答

自闭症患者

2020-12-10 18:26

Looks like you're on the right track. Just open the connection once (you don't need to use <<-, just <-; use a larger chunk size so that R's vectorized operations can be used to process each chunk efficiently), along the lines of

filename <- "nameoffile.txt"
nrows <- 1000000
con <- file(description=filename,open="r")    
## N.B.: skip = 17 from original prob.! Usually not needed (thx @Moody_Mudskipper)
data <- read.table(con, nrows=nrows, skip=17, header=FALSE)
repeat {
    if (nrow(data) == 0)
        break
    ## process chunk 'data' here, then...
    ## ...read next chunk
    if (nrow(data) != nrows)   # last chunk was final chunk
        break
    data <- tryCatch({
        read.table(con, nrows=nrows, skip=0, header=FALSE)
    }, error=function(err) {
       ## matching condition message only works when message is not translated
       if (identical(conditionMessage(err), "no lines available in input"))
          data.frame()
       else stop(err)
    })
}
close(con)

Iteration seems to me like a good strategy, especially for a file that you're going to process once rather than say reference repeatedly like a data base. The answer is modified to try to be more robust about detecting reading at the end of the file.

0 讨论(0)