I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.
If you get the error
This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.
DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 40001072 bytes #40 MB
Two OOM larger:
DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 4000001072 bytes #4.00 GB
There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using
x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).
x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).
Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.