Cleanest, most efficient syntax to perform DataFrame self-join in Spark

后端 未结 1 1819
鱼传尺愫
鱼传尺愫 2020-12-08 19:05

In standard SQL, when you join a table to itself, you can create aliases for the tables to keep track of which columns you are referring to:

SELECT a.column_         


        
相关标签:
1条回答
  • 2020-12-08 19:41

    There are at least two different ways you can approach this either by aliasing:

    df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")
    

    or using name-based equality joins:

    // Note that it will result in ambiguous column names
    // so using aliases here could be a good idea as well.
    // df.as("df1").join(df.as("df2"), Seq("foo"))
    
    df.join(df, Seq("foo"))  
    

    In general column renaming, while the ugliest, is the safest practice across all the versions. There have been a few bugs related to column resolution (we found one on SO not so long ago) and some details may differ between parsers (HiveContext / standard SQLContext) if you use raw expressions.

    Personally I prefer using aliases because their resemblance to an idiomatic SQL and ability to use outside the scope of a specific DataFrame objects.

    Regarding performance unless you're interested in close-to-real-time processing there should be no performance difference whatsoever. All of these should generate the same execution plan.

    0 讨论(0)
提交回复
热议问题