reshape alternating columns in less time and using less memory

心已入冬 提交于 2019-12-02 03:42:18

I doubt very much that this will succeed with that small amount of RAM when passing a 500000 x 500 dataframe. I wonder whether you could do even simple actions in that limited space. Buy more RAM. Furthermore, reshape2 is slow, so use stats::reshape for big stuff. And give it hints about what the separator is.

> set.seed(007)
> dat <- make_example(5, 3)
> dat
  docnum filename ntop_1     ptop_1 ntop_2    ptop_2 ntop_3    ptop_3
1      1    y8214      3 0.06564574      1 0.6799935      2 0.8470244
2      2    e6x39      2 0.62703876      1 0.2637199      3 0.4980761
3      3    34c19      3 0.49047504      3 0.1857143      3 0.7905856
4      4    1H0y6      2 0.97102441      3 0.1851432      2 0.8384639
5      5    P6zqy      3 0.36222085      3 0.3792967      3 0.4569039

> reshape(dat, direction="long", varying=3:8, sep="_")
    docnum filename time ntop       ptop id
1.1      1    y8214    1    3 0.06564574  1
2.1      2    e6x39    1    2 0.62703876  2
3.1      3    34c19    1    3 0.49047504  3
4.1      4    1H0y6    1    2 0.97102441  4
5.1      5    P6zqy    1    3 0.36222085  5
1.2      1    y8214    2    1 0.67999346  1
2.2      2    e6x39    2    1 0.26371993  2
3.2      3    34c19    2    3 0.18571426  3
4.2      4    1H0y6    2    3 0.18514322  4
5.2      5    P6zqy    2    3 0.37929675  5
1.3      1    y8214    3    2 0.84702439  1
2.3      2    e6x39    3    3 0.49807613  2
3.3      3    34c19    3    3 0.79058557  3
4.3      4    1H0y6    3    2 0.83846387  4
5.3      5    P6zqy    3    3 0.45690386  5

> system.time( dat <- make_example(5000,100) )
   user  system elapsed 
  2.925   0.131   3.043 
> system.time( dat2 <-  reshape(dat, direction="long", varying=3:202, sep="_"))
   user  system elapsed 
 16.766   8.608  25.272 

I'd say that around 1/5 of total in 32 GB memory got used during that process that was 250 times smaller than your goal, so I'm not surprised that your machine hung. (It should not have "crashed". The authors of R would prefer that you give accurate descriptions of behavior and I suspect the R process stopped responding when it paged into virtual memory.) I have performance issues that I need to work around with a dataset that is 7 million records x 100 columns when using 32 GB.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!