Spark SQL broadcast hash join

后端未结

关注

 3  1080

情书的邮戳 2020-12-01 06:50

I\'m trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL

3条回答

感情败类 (楼主)

2020-12-01 07:23
```
jon_rdd = sqlContext.sql( "select * from people_in_india  p
                            join states s
                            on p.state = s.name")


jon_rdd.toDebugString() / join_rdd.explain() : 
```
shuffledHashJoin :
all the data for the India will be shuffled into only 29 keys for each of the states. Problems: uneven sharding. Limited parallelism with 29 output partitions.

broadcaseHashJoin:

broadcast the small RDD to all worker nodes. parallelism of the large rdd is still maintained and shuffle is not even required.

PS: Image may ugly but informative.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...