发表新帖

发表新帖

Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

前端未结

关注

 1  1147

I execute a join using a javaHiveContext in Spark.

The big table is 1,76Gb and has 100 millions record.

The second table is 273Mb and has 10 mil

相关标签:

1条回答

闹比i

2020-12-02 01:21

Most likely the source of the problem is a cost of broadcasting. To make things simple lets assume that you have 1800MB in the larger RDD and 300MB in the smaller one. Assuming 5 executors and no previous partitioning a fifth of all data should be already on the correct machine. It lefts ~1700MB for shuffling in case of standard join.

For broadcast join the smaller RDD has to be transfered to all nodes. It means around 1500MB data to be transfered. If you add required communication with driver it means you have to move a comparable amount of data in a much more expensive way. A broadcasted data has to be collected first and only after that can be forwarded to all the workers.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题