spark cross join,two similar code,one works,one not

天大地大妈咪最大 提交于 2019-12-20 07:07:57

问题


I have a following code:

val ori0 = Seq(
  (0l, "1")
).toDF("id", "col1")
val date0 = Seq(
  (0l, "1")
).toDF("id", "date")

val joinExpression = $"col1" === $"date"
ori0.join(date0, joinExpression).show()

val ori = spark.range(1).withColumn("col1", lit("1"))
val date = spark.range(1).withColumn("date", lit("1"))
ori.join(date,joinExpression).show()

The first join works,but the second has an error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Range (0, 1, step=1, splits=Some(4))
and
Project [_1#11L AS id#14L, _2#12 AS date#15]
+- Filter (isnotnull(_2#12) && (1 = _2#12))
   +- LocalRelation [_1#11L, _2#12]
Join condition is missing or trivial.

I watch it for many time many time, I do not know why it is cross join,and what is the difference between them?


回答1:


If you were to expand the second join you'd see that it is really equivalent to:

SELECT * 
FROM ori JOIN date
WHERE 1 = 1

Clearly WHERE 1 = 1 join condition trivial, which is one of the conditions under which Spark detects Cartesian.

In the first case this is not the case because optimizer cannot infer at this point that join columns contain only a single value, and will attempt to apply hash or sort merge join.



来源:https://stackoverflow.com/questions/52552868/spark-cross-join-two-similar-code-one-works-one-not

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!