I\'m trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL
With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join.
As the data set is getting materialized and send over the network it does only bring significant performance improvement, if it considerable small.
So if you are trying to perform smallDF.join(largeDF)
Wait..!!! another constraint is that it also needs to fit completely into the memory of each executor.It also needs to fit into the memory of the Driver!
Broadcast variables are shared among executors using the Torrent protocol i.e.Peer-to-Peer protocol and the advantage of the Torrent protocol is that peers share blocks of a file among each other not relying on a central entity holding all the blocks.
Above mentioned example is sufficient enough to start playing with broadcast join.
Note: Cannot modify value after creation. If you try, change will only be on one&node