R:Loops to process large dataset(GBs) in chunks?

前端 未结 1 1686
孤街浪徒
孤街浪徒 2020-12-10 18:21

I have a large data set in GBs that I\'d have to process before I analyse them. I tried creating a connector, which allows me to loop through the large datasets and extract

相关标签:
1条回答
  • 2020-12-10 18:26

    Looks like you're on the right track. Just open the connection once (you don't need to use <<-, just <-; use a larger chunk size so that R's vectorized operations can be used to process each chunk efficiently), along the lines of

    filename <- "nameoffile.txt"
    nrows <- 1000000
    con <- file(description=filename,open="r")    
    ## N.B.: skip = 17 from original prob.! Usually not needed (thx @Moody_Mudskipper)
    data <- read.table(con, nrows=nrows, skip=17, header=FALSE)
    repeat {
        if (nrow(data) == 0)
            break
        ## process chunk 'data' here, then...
        ## ...read next chunk
        if (nrow(data) != nrows)   # last chunk was final chunk
            break
        data <- tryCatch({
            read.table(con, nrows=nrows, skip=0, header=FALSE)
        }, error=function(err) {
           ## matching condition message only works when message is not translated
           if (identical(conditionMessage(err), "no lines available in input"))
              data.frame()
           else stop(err)
        })
    }
    close(con)    
    

    Iteration seems to me like a good strategy, especially for a file that you're going to process once rather than say reference repeatedly like a data base. The answer is modified to try to be more robust about detecting reading at the end of the file.

    0 讨论(0)
提交回复
热议问题