How to avoid shuffles while joining DataFrames on unique keys?

后端 未结 3 736
忘了有多久
忘了有多久 2020-12-07 19:38

I have two DataFrames A and B:

  • A has columns (id, info1, info2) with about 200 Million rows
3条回答
  •  独厮守ぢ
    2020-12-07 20:03

    If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B on every node so that the semi-join computation (i.e., using a join to filter id from DataFrame A) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).

    You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:

    import org.apache.spark.sql.functions.broadcast
    
    val joinExpr = A.col("id") === B.col("id")
    
    val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
    

    You can run filtered_A.explain() to verify that a broadcast join is being used.

提交回复
热议问题