Question about joining dataframes in Spark

社会主义新天地 提交于 2019-12-03 16:17:51

I guess was able to figure out the reason of different result in Python and Scala.

The reason is in broadcast optimisation. If spark-shell is started with broadcast disabled both Python and Scala works identically.

./spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1

val df1 = Seq(
  (1, 1, 1)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))

val df2 = Seq(
  (1, 1, 1),
  (2, 2, 2)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))

val x = df1.join(df2, usingColumns = Seq("key1", "key2", "time"))

x.rdd.getNumPartitions == 200

So looks like spark 2.4.0 isn't able to optimise described case out of the box and catalyst optimizer extension needed as suggested by @user10938362.

BTW. Here are info about writing catalyst optimizer extensions https://developer.ibm.com/code/2017/11/30/learn-extension-points-apache-spark-extend-spark-catalyst-optimizer/

The behaviour of Catalyst Optimizer differs between pyspark and Scala (using Spark 2.4 at least).

I ran both and got two different plans.

Indeed you get 200 partitions in pyspark, unless you state for pyspark explicitly:

 spark.conf.set("spark.sql.shuffle.partitions", 3)

Then 3 partitions are processed, and thus 3 retained under pyspark.

A little surprised as I would have thought under the hood it would be common. So people keep telling me. It just goes to show.

Physical Plan for pyspark with param set via conf:

== Physical Plan ==
*(5) Project [key1#344L, key2#345L, time#346L]
+- SortMergeJoin [key1#344L, key2#345L, time#346L], [key1#350L, key2#351L, time#352L], LeftOuter
   :- *(2) Sort [key1#344L ASC NULLS FIRST, key2#345L ASC NULLS FIRST, time#346L ASC NULLS FIRST], false, 0
    :  +- Exchange hashpartitioning(key1#344L, key2#345L, time#346L, 3)
    :     +- *(1) Scan ExistingRDD[key1#344L,key2#345L,time#346L]
    +- *(4) Sort [key1#350L ASC NULLS FIRST, key2#351L ASC NULLS FIRST, time#352L ASC NULLS FIRST], false, 0
       +- Exchange hashpartitioning(key1#350L, key2#351L, time#352L, 3)
         +- *(3) Filter ((isnotnull(key1#350L) && isnotnull(key2#351L)) && isnotnull(time#352L))
             +- *(3) Scan ExistingRDD[key1#350L,key2#351L,time#352L]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!