I\'m trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL
jon_rdd = sqlContext.sql( "select * from people_in_india p
join states s
on p.state = s.name")
jon_rdd.toDebugString() / join_rdd.explain() :
shuffledHashJoin :
all the data for the India will be shuffled into only 29 keys for each of the states.
Problems:
uneven sharding.
Limited parallelism with 29 output partitions.
broadcaseHashJoin:
broadcast the small RDD to all worker nodes. parallelism of the large rdd is still maintained and shuffle is not even required.
PS: Image may ugly but informative.