What is the fastest way and fastest format for loading large data sets into R [duplicate]

风流意气都作罢 提交于 2020-06-25 07:02:34

问题


I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set?

My suspicion is that the optimal choice is something like

 saveRDS(obj, file = 'bigdata.Rda', compress = FALSE)
 obj <- loadRDS('bigdata.Rda)

But this seems slower than using fread function in the data.table package. This should not be the case because fread converts a file from CSV (although it is admittedly highly optimized).

Some timings for a ~800MB dataset are:

> system.time(tmp <- fread("data.csv"))
Read 6135344 rows and 22 (of 22) columns from 0.795 GB file in 00:00:43
     user  system elapsed 
     36.94    0.44   42.71 
 saveRDS(tmp, file = 'tmp.Rda'))
> system.time(tmp <- readRDS('tmp.Rda'))
     user  system elapsed 
     69.96    2.02   84.04

Previous Questions

This question is related but does not reflect the current state of R, for example an answer suggests reading from a binary format will always be faster than a text format. The suggestion to use *SQL is also not helpful in my case as the entire data set is required, not just a subset of it.

There are also related questions about the fastest way of loading data once (eg: 1).


回答1:


It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).

If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.



来源:https://stackoverflow.com/questions/31887230/what-is-the-fastest-way-and-fastest-format-for-loading-large-data-sets-into-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!