Using R with tidyquant and massiv data

与世无争的帅哥 提交于 2019-12-09 21:41:40

问题


While working with R I encountered a strange problem: I am processing date in the follwing manner: Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB.

Somehting like this: https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html

For small data sets this works like a charm, but with lager ones (over about 100000 entries) I do get the "R Session Aborted" screen from R-Studio and the nativ R GUI just stops execution and implodes. There is no information in every log file that I've look into. I suspect that it is some kind of (leaking) memory issue.

As a work around I'm processing the data in chunks with a for-loop. But no matter how small the chunk size is, I do get the "R Session Aborted" screen, which looks a lot like leaking memory. The whole date consist of about 5 million rows.

I've looked a lot into packages like ff, the big-Family and matter basically everything from https://cran.r-project.org/web/views/HighPerformanceComputing.html but this dose not seem to work well with tibbles and the tidyverse way of data processing.

So, how can I improve my scrip to work with massiv amounts of data? How can I gather clues about why the R Session is Aborted?


回答1:


Check out the article at:

datascience.la/dplyr-and-a-very-basic-benchmark

There is a table that shows runtime comparisons for some of the data wrangling tasks you are performing. From the table, it looks as though dplyr with data.table behind it is likely going to do much better than dplyr with a dataframe behind it.

There’s a link to the benchmarking code used to make the table, too.

In short, try adding a key, and try using data.table over dataframe.

To make x your key, and say your data.table is named dt, use setkey(dt,x).




回答2:


While Pakes answer deals with the described problem I found a solution to the underlying problem. For Compatibility reason I used R in the 3.4.3 version. Now I'm using the newer 3.5.1 version which works quite fine.



来源:https://stackoverflow.com/questions/51517395/using-r-with-tidyquant-and-massiv-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!