Spark SQL broadcast hash join

后端 未结 3 1080
情书的邮戳
情书的邮戳 2020-12-01 06:50

I\'m trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL

3条回答
  •  感情败类
    2020-12-01 07:23

    jon_rdd = sqlContext.sql( "select * from people_in_india  p
                                join states s
                                on p.state = s.name")
    
    
    jon_rdd.toDebugString() / join_rdd.explain() : 
    

    shuffledHashJoin :
    all the data for the India will be shuffled into only 29 keys for each of the states. Problems: uneven sharding. Limited parallelism with 29 output partitions.

    broadcaseHashJoin:

    broadcast the small RDD to all worker nodes. parallelism of the large rdd is still maintained and shuffle is not even required.

    PS: Image may ugly but informative.

提交回复
热议问题