I have two DataFrames A
and B
:
A
has columns (id, info1, info2)
with about 200 Million rows
If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B
on every node so that the semi-join computation (i.e., using a join to filter id
from DataFrame A
) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).
You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:
import org.apache.spark.sql.functions.broadcast
val joinExpr = A.col("id") === B.col("id")
val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
You can run filtered_A.explain()
to verify that a broadcast join is being used.