Spark SQL broadcast hash join

后端未结

关注

 3  1073

情书的邮戳 2020-12-01 06:50

I\'m trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL

3条回答

暖寄归人 (楼主)

2020-12-01 07:38

With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join.

As the data set is getting materialized and send over the network it does only bring significant performance improvement, if it considerable small.

So if you are trying to perform smallDF.join(largeDF)

Wait..!!! another constraint is that it also needs to fit completely into the memory of each executor.It also needs to fit into the memory of the Driver!

Broadcast variables are shared among executors using the Torrent protocol i.e.Peer-to-Peer protocol and the advantage of the Torrent protocol is that peers share blocks of a file among each other not relying on a central entity holding all the blocks.

Above mentioned example is sufficient enough to start playing with broadcast join.

Note: Cannot modify value after creation. If you try, change will only be on one&node

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...