发表新帖

发表新帖

How to use COGROUP for large datasets

后端未结

关注

 2  932

自闭症患者 2020-11-28 16:40

I have two rdd\'s namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I\'m using cogroup for tho

2条回答

野性不改 (楼主)

2020-11-28 17:24

TL;DR Don't collect.

To run this code safely, without additional assumptions (on average requirements for worker nodes might be significantly smaller), every node (driver and each executor) would require memory significantly exceeding total memory requirements for all data.

If you were to run it outside Spark you would need only one node. Therefore Spark provides no benefits here.

However if you skip collect.toArray and make some assumptions about data distribution you might run it just fine.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题