R:Loops to process large dataset(GBs) in chunks?

我怕爱的太早我们不能终老 提交于 2019-11-28 11:31:47

Looks like you're on the right track. Just open the connection once (you don't need to use <<-, just <-; use a larger chunk size so that R's vectorized operations can be used to process each chunk efficiently), along the lines of

filename <- "nameoffile.txt"
nrows <- 1000000
con <- file(description=filename,open="r")    
## N.B.: skip = 17 from original prob.! Usually not needed (thx @Moody_Mudskipper)
data <- read.table(con, nrows=nrows, skip=17, header=FALSE)
repeat {
    if (nrow(data) == 0)
        break
    ## process chunk 'data' here, then...
    ## ...read next chunk
    if (nrow(data) != nrows)   # last chunk was final chunk
        break
    data <- tryCatch({
        read.table(con, nrows=nrows, skip=0, header=FALSE)
    }, error=function(err) {
       ## matching condition message only works when message is not translated
       if (identical(conditionMessage(err), "no lines available in input"))
          data.frame()
       else stop(err)
    })
}
close(con)    

Iteration seems to me like a good strategy, especially for a file that you're going to process once rather than say reference repeatedly like a data base. The answer is modified to try to be more robust about detecting reading at the end of the file.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!