Spark join produces wrong results

前端 未结 1 1420
甜味超标
甜味超标 2020-12-19 15:29

Presenting here before possibly filing a bug. I\'m using Spark 1.6.0.

This is a simplified version of the problem I\'m dealing with. I\'ve filtered a table, and the

1条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-19 16:18

    If you want an expected behavior use either join on names:

    val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
    val a = b.where("c = 1")
    
    a.join(b, Seq("a", "b", "c")).show
    // +---+---+---+
    // |  a|  b|  c|
    // +---+---+---+
    // |  a|  b|  1|
    // +---+---+---+
    

    or aliases:

    val aa = a.alias("a")
    val bb = b.alias("b")
    
    aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")
    

    You can use <=> as well:

    aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")
    

    As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.

    The second behavior looks indeed like a bug related to the fact that you still have a.c in your data. It looks like it is picked downstream before b.c and the evaluated condition is actually a.newc = a.c.

    val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
    a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")
    

    0 讨论(0)
提交回复
热议问题