R reading a huge csv

后端 未结 5 822
慢半拍i
慢半拍i 2020-12-23 23:12

I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.

If you get the error          


        
5条回答
  •  眼角桃花
    2020-12-23 23:36

    This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.

    DT <- data.table(x = sample(1:2,10000000,replace = T))
    write.csv(DT, "test.csv") #29 MB file
    DT <- fread("test.csv", row.names = F)   
    object.size(DT)
    > 40001072 bytes #40 MB
    

    Two OOM larger:

    DT <- data.table(x = sample(1:2,1000000000,replace = T))
    write.csv(DT, "test.csv") #2.92 GB file
    DT <- fread("test.csv", row.names = F)   
    object.size(DT)
    > 4000001072 bytes #4.00 GB
    

    There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using

    • x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).

    • x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).

    Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.

提交回复
热议问题