Presenting here before possibly filing a bug. I\'m using Spark 1.6.0.
This is a simplified version of the problem I\'m dealing with. I\'ve filtered a table, and the
If you want an expected behavior use either join
on names:
val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a = b.where("c = 1")
a.join(b, Seq("a", "b", "c")).show
// +---+---+---+
// | a| b| c|
// +---+---+---+
// | a| b| 1|
// +---+---+---+
or aliases:
val aa = a.alias("a")
val bb = b.alias("b")
aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")
You can use <=>
as well:
aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")
As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.
The second behavior looks indeed like a bug related to the fact that you still have a.c
in your data. It looks like it is picked downstream before b.c
and the evaluated condition is actually a.newc = a.c
.
val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")