发表新帖

发表新帖

How to use COGROUP for large datasets

后端未结

关注

 2  933

自闭症患者 2020-11-28 16:40

I have two rdd\'s namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I\'m using cogroup for tho

2条回答

鱼传尺愫 (楼主)

2020-11-28 17:20

When you use collect() you are basically telling spark to move all the resulting data back to the master node, which can easily produce a bottleneck. You are no longer using Spark at that point, just a plain array in a single machine.

To trigger computation just use something that requires the data at every node, that's why executors live on top of a distributed file system. For instance saveAsTextFile().

Here are some basic examples.

Remember, the entire objective here (that is, if you have big data) is to move the code to your data and compute there, not to bring all the data to the computation.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题