Joining Spark dataframes on the key

后端 未结 8 2176
情书的邮戳
情书的邮戳 2020-11-28 03:02

I have constructed two dataframes. How can we join multiple Spark dataframes ?

For Example :

PersonDf, ProfileDf with a common col

8条回答
  •  忘掉有多难
    2020-11-28 03:28

    Apart from my above answer I tried to demonstrate all the spark joins with same case classes using spark 2.x here is my linked in article with full examples and explanation .

    All join types : Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    
    
     /**
      * @author : Ram Ghadiyaram
      */
    object SparkJoinTypesDemo extends App {
      private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
      spark.sparkContext.setLogLevel("ERROR")
      case class Person(name: String, age: Int, personid: Int)
      case class Profile(profileName: String, personid: Int, profileDescription: String)
      /**
        * * @param joinType Type of join to perform. Default `inner`. Must be one of:
        * *                 `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,
        * *                 `right`, `right_outer`, `left_semi`, `left_anti`.
        */
      val joinTypes = Seq(
        "inner"
        , "outer"
        , "full"
        , "full_outer"
        , "left"
        , "left_outer"
        , "right"
        , "right_outer"
        , "left_semi"
        , "left_anti"
        //, "cross"
      )
      val df1 = spark.sqlContext.createDataFrame(
        Person("Nataraj", 45, 2)
          :: Person("Srinivas", 45, 5)
          :: Person("Ashik", 22, 9)
          :: Person("Deekshita", 22, 8)
          :: Person("Siddhika", 22, 4)
          :: Person("Madhu", 22, 3)
          :: Person("Meghna", 22, 2)
          :: Person("Snigdha", 22, 2)
          :: Person("Harshita", 22, 6)
          :: Person("Ravi", 42, 0)
          :: Person("Ram", 42, 9)
          :: Person("Chidananda Raju", 35, 9)
          :: Person("Sreekanth Doddy", 29, 9)
          :: Nil)
      val df2 = spark.sqlContext.createDataFrame(
        Profile("Spark", 2, "SparkSQLMaster")
          :: Profile("Spark", 5, "SparkGuru")
          :: Profile("Spark", 9, "DevHunter")
          :: Profile("Spark", 3, "Evangelist")
          :: Profile("Spark", 0, "Committer")
          :: Profile("Spark", 1, "All Rounder")
          :: Nil
      )
      val df_asPerson = df1.as("dfperson")
      val df_asProfile = df2.as("dfprofile")
      val joined_df = df_asPerson.join(
        df_asProfile
        , col("dfperson.personid") === col("dfprofile.personid")
        , "inner")
    
      println("First example inner join  ")
    
    
      // you can do alias to refer column name with aliases to  increase readability
      joined_df.select(
        col("dfperson.name")
        , col("dfperson.age")
        , col("dfprofile.profileName")
        , col("dfprofile.profileDescription"))
        .show
      println("all joins in a loop")
      joinTypes foreach { joinType =>
        println(s"${joinType.toUpperCase()} JOIN")
        df_asPerson.join(right = df_asProfile, usingColumns = Seq("personid"), joinType = joinType)
          .orderBy("personid")
          .show()
      }
      println(
        """
          |Till 1.x  cross join is :  df_asPerson.join(df_asProfile)
          |
          | Explicit Cross Join in 2.x :
          | http://blog.madhukaraphatak.com/migrating-to-spark-two-part-4/
          | Cartesian joins are very expensive without an extra filter that can be pushed down.
          |
          | cross join or cartesian product
          |
          |
        """.stripMargin)
    
      val crossJoinDf = df_asPerson.crossJoin(right = df_asProfile)
      crossJoinDf.show(200, false)
      println(crossJoinDf.explain())
      println(crossJoinDf.count)
    
      println("createOrReplaceTempView example ")
      println(
        """
          |Creates a local temporary view using the given name. The lifetime of this
          |   temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
        """.stripMargin)
    
    
    
    
      df_asPerson.createOrReplaceTempView("dfperson");
      df_asProfile.createOrReplaceTempView("dfprofile")
      val sql =
        s"""
           |SELECT dfperson.name
           |, dfperson.age
           |, dfprofile.profileDescription
           |  FROM  dfperson JOIN  dfprofile
           | ON dfperson.personid == dfprofile.personid
        """.stripMargin
      println(s"createOrReplaceTempView  sql $sql")
      val sqldf = spark.sql(sql)
      sqldf.show
    
    
      println(
        """
          |
          |**** EXCEPT DEMO ***
          |
      """.stripMargin)
      println(" df_asPerson.except(df_asProfile) Except demo")
      df_asPerson.except(df_asProfile).show
    
    
      println(" df_asProfile.except(df_asPerson) Except demo")
      df_asProfile.except(df_asPerson).show
    }
    

    Result :

    First example inner join  
    +---------------+---+-----------+------------------+
    |           name|age|profileName|profileDescription|
    +---------------+---+-----------+------------------+
    |        Nataraj| 45|      Spark|    SparkSQLMaster|
    |       Srinivas| 45|      Spark|         SparkGuru|
    |          Ashik| 22|      Spark|         DevHunter|
    |          Madhu| 22|      Spark|        Evangelist|
    |         Meghna| 22|      Spark|    SparkSQLMaster|
    |        Snigdha| 22|      Spark|    SparkSQLMaster|
    |           Ravi| 42|      Spark|         Committer|
    |            Ram| 42|      Spark|         DevHunter|
    |Chidananda Raju| 35|      Spark|         DevHunter|
    |Sreekanth Doddy| 29|      Spark|         DevHunter|
    +---------------+---+-----------+------------------+
    
    all joins in a loop
    INNER JOIN
    +--------+---------------+---+-----------+------------------+
    |personid|           name|age|profileName|profileDescription|
    +--------+---------------+---+-----------+------------------+
    |       0|           Ravi| 42|      Spark|         Committer|
    |       2|        Snigdha| 22|      Spark|    SparkSQLMaster|
    |       2|         Meghna| 22|      Spark|    SparkSQLMaster|
    |       2|        Nataraj| 45|      Spark|    SparkSQLMaster|
    |       3|          Madhu| 22|      Spark|        Evangelist|
    |       5|       Srinivas| 45|      Spark|         SparkGuru|
    |       9|            Ram| 42|      Spark|         DevHunter|
    |       9|          Ashik| 22|      Spark|         DevHunter|
    |       9|Chidananda Raju| 35|      Spark|         DevHunter|
    |       9|Sreekanth Doddy| 29|      Spark|         DevHunter|
    +--------+---------------+---+-----------+------------------+
    
    OUTER JOIN
    +--------+---------------+----+-----------+------------------+
    |personid|           name| age|profileName|profileDescription|
    +--------+---------------+----+-----------+------------------+
    |       0|           Ravi|  42|      Spark|         Committer|
    |       1|           null|null|      Spark|       All Rounder|
    |       2|        Nataraj|  45|      Spark|    SparkSQLMaster|
    |       2|        Snigdha|  22|      Spark|    SparkSQLMaster|
    |       2|         Meghna|  22|      Spark|    SparkSQLMaster|
    |       3|          Madhu|  22|      Spark|        Evangelist|
    |       4|       Siddhika|  22|       null|              null|
    |       5|       Srinivas|  45|      Spark|         SparkGuru|
    |       6|       Harshita|  22|       null|              null|
    |       8|      Deekshita|  22|       null|              null|
    |       9|          Ashik|  22|      Spark|         DevHunter|
    |       9|            Ram|  42|      Spark|         DevHunter|
    |       9|Chidananda Raju|  35|      Spark|         DevHunter|
    |       9|Sreekanth Doddy|  29|      Spark|         DevHunter|
    +--------+---------------+----+-----------+------------------+
    
    FULL JOIN
    +--------+---------------+----+-----------+------------------+
    |personid|           name| age|profileName|profileDescription|
    +--------+---------------+----+-----------+------------------+
    |       0|           Ravi|  42|      Spark|         Committer|
    |       1|           null|null|      Spark|       All Rounder|
    |       2|        Nataraj|  45|      Spark|    SparkSQLMaster|
    |       2|         Meghna|  22|      Spark|    SparkSQLMaster|
    |       2|        Snigdha|  22|      Spark|    SparkSQLMaster|
    |       3|          Madhu|  22|      Spark|        Evangelist|
    |       4|       Siddhika|  22|       null|              null|
    |       5|       Srinivas|  45|      Spark|         SparkGuru|
    |       6|       Harshita|  22|       null|              null|
    |       8|      Deekshita|  22|       null|              null|
    |       9|          Ashik|  22|      Spark|         DevHunter|
    |       9|            Ram|  42|      Spark|         DevHunter|
    |       9|Sreekanth Doddy|  29|      Spark|         DevHunter|
    |       9|Chidananda Raju|  35|      Spark|         DevHunter|
    +--------+---------------+----+-----------+------------------+
    
    FULL_OUTER JOIN
    +--------+---------------+----+-----------+------------------+
    |personid|           name| age|profileName|profileDescription|
    +--------+---------------+----+-----------+------------------+
    |       0|           Ravi|  42|      Spark|         Committer|
    |       1|           null|null|      Spark|       All Rounder|
    |       2|        Nataraj|  45|      Spark|    SparkSQLMaster|
    |       2|         Meghna|  22|      Spark|    SparkSQLMaster|
    |       2|        Snigdha|  22|      Spark|    SparkSQLMaster|
    |       3|          Madhu|  22|      Spark|        Evangelist|
    |       4|       Siddhika|  22|       null|              null|
    |       5|       Srinivas|  45|      Spark|         SparkGuru|
    |       6|       Harshita|  22|       null|              null|
    |       8|      Deekshita|  22|       null|              null|
    |       9|          Ashik|  22|      Spark|         DevHunter|
    |       9|            Ram|  42|      Spark|         DevHunter|
    |       9|Chidananda Raju|  35|      Spark|         DevHunter|
    |       9|Sreekanth Doddy|  29|      Spark|         DevHunter|
    +--------+---------------+----+-----------+------------------+
    
    LEFT JOIN
    +--------+---------------+---+-----------+------------------+
    |personid|           name|age|profileName|profileDescription|
    +--------+---------------+---+-----------+------------------+
    |       0|           Ravi| 42|      Spark|         Committer|
    |       2|        Snigdha| 22|      Spark|    SparkSQLMaster|
    |       2|         Meghna| 22|      Spark|    SparkSQLMaster|
    |       2|        Nataraj| 45|      Spark|    SparkSQLMaster|
    |       3|          Madhu| 22|      Spark|        Evangelist|
    |       4|       Siddhika| 22|       null|              null|
    |       5|       Srinivas| 45|      Spark|         SparkGuru|
    |       6|       Harshita| 22|       null|              null|
    |       8|      Deekshita| 22|       null|              null|
    |       9|            Ram| 42|      Spark|         DevHunter|
    |       9|          Ashik| 22|      Spark|         DevHunter|
    |       9|Chidananda Raju| 35|      Spark|         DevHunter|
    |       9|Sreekanth Doddy| 29|      Spark|         DevHunter|
    +--------+---------------+---+-----------+------------------+
    
    LEFT_OUTER JOIN
    +--------+---------------+---+-----------+------------------+
    |personid|           name|age|profileName|profileDescription|
    +--------+---------------+---+-----------+------------------+
    |       0|           Ravi| 42|      Spark|         Committer|
    |       2|        Nataraj| 45|      Spark|    SparkSQLMaster|
    |       2|         Meghna| 22|      Spark|    SparkSQLMaster|
    |       2|        Snigdha| 22|      Spark|    SparkSQLMaster|
    |       3|          Madhu| 22|      Spark|        Evangelist|
    |       4|       Siddhika| 22|       null|              null|
    |       5|       Srinivas| 45|      Spark|         SparkGuru|
    |       6|       Harshita| 22|       null|              null|
    |       8|      Deekshita| 22|       null|              null|
    |       9|Chidananda Raju| 35|      Spark|         DevHunter|
    |       9|Sreekanth Doddy| 29|      Spark|         DevHunter|
    |       9|          Ashik| 22|      Spark|         DevHunter|
    |       9|            Ram| 42|      Spark|         DevHunter|
    +--------+---------------+---+-----------+------------------+
    
    RIGHT JOIN
    +--------+---------------+----+-----------+------------------+
    |personid|           name| age|profileName|profileDescription|
    +--------+---------------+----+-----------+------------------+
    |       0|           Ravi|  42|      Spark|         Committer|
    |       1|           null|null|      Spark|       All Rounder|
    |       2|        Snigdha|  22|      Spark|    SparkSQLMaster|
    |       2|         Meghna|  22|      Spark|    SparkSQLMaster|
    |       2|        Nataraj|  45|      Spark|    SparkSQLMaster|
    |       3|          Madhu|  22|      Spark|        Evangelist|
    |       5|       Srinivas|  45|      Spark|         SparkGuru|
    |       9|Sreekanth Doddy|  29|      Spark|         DevHunter|
    |       9|Chidananda Raju|  35|      Spark|         DevHunter|
    |       9|            Ram|  42|      Spark|         DevHunter|
    |       9|          Ashik|  22|      Spark|         DevHunter|
    +--------+---------------+----+-----------+------------------+
    
    RIGHT_OUTER JOIN
    +--------+---------------+----+-----------+------------------+
    |personid|           name| age|profileName|profileDescription|
    +--------+---------------+----+-----------+------------------+
    |       0|           Ravi|  42|      Spark|         Committer|
    |       1|           null|null|      Spark|       All Rounder|
    |       2|         Meghna|  22|      Spark|    SparkSQLMaster|
    |       2|        Snigdha|  22|      Spark|    SparkSQLMaster|
    |       2|        Nataraj|  45|      Spark|    SparkSQLMaster|
    |       3|          Madhu|  22|      Spark|        Evangelist|
    |       5|       Srinivas|  45|      Spark|         SparkGuru|
    |       9|Sreekanth Doddy|  29|      Spark|         DevHunter|
    |       9|          Ashik|  22|      Spark|         DevHunter|
    |       9|Chidananda Raju|  35|      Spark|         DevHunter|
    |       9|            Ram|  42|      Spark|         DevHunter|
    +--------+---------------+----+-----------+------------------+
    
    LEFT_SEMI JOIN
    +--------+---------------+---+
    |personid|           name|age|
    +--------+---------------+---+
    |       0|           Ravi| 42|
    |       2|        Nataraj| 45|
    |       2|         Meghna| 22|
    |       2|        Snigdha| 22|
    |       3|          Madhu| 22|
    |       5|       Srinivas| 45|
    |       9|Chidananda Raju| 35|
    |       9|Sreekanth Doddy| 29|
    |       9|            Ram| 42|
    |       9|          Ashik| 22|
    +--------+---------------+---+
    
    LEFT_ANTI JOIN
    +--------+---------+---+
    |personid|     name|age|
    +--------+---------+---+
    |       4| Siddhika| 22|
    |       6| Harshita| 22|
    |       8|Deekshita| 22|
    +--------+---------+---+
    
    
    Till 1.x  Cross join is :  `df_asPerson.join(df_asProfile)`
    
     Explicit Cross Join in 2.x :
     http://blog.madhukaraphatak.com/migrating-to-spark-two-part-4/
     Cartesian joins are very expensive without an extra filter that can be pushed down.
    
     Cross join or Cartesian product
    
    
    
    +---------------+---+--------+-----------+--------+------------------+
    |name           |age|personid|profileName|personid|profileDescription|
    +---------------+---+--------+-----------+--------+------------------+
    |Nataraj        |45 |2       |Spark      |2       |SparkSQLMaster    |
    |Nataraj        |45 |2       |Spark      |5       |SparkGuru         |
    |Nataraj        |45 |2       |Spark      |9       |DevHunter         |
    |Nataraj        |45 |2       |Spark      |3       |Evangelist        |
    |Nataraj        |45 |2       |Spark      |0       |Committer         |
    |Nataraj        |45 |2       |Spark      |1       |All Rounder       |
    |Srinivas       |45 |5       |Spark      |2       |SparkSQLMaster    |
    |Srinivas       |45 |5       |Spark      |5       |SparkGuru         |
    |Srinivas       |45 |5       |Spark      |9       |DevHunter         |
    |Srinivas       |45 |5       |Spark      |3       |Evangelist        |
    |Srinivas       |45 |5       |Spark      |0       |Committer         |
    |Srinivas       |45 |5       |Spark      |1       |All Rounder       |
    |Ashik          |22 |9       |Spark      |2       |SparkSQLMaster    |
    |Ashik          |22 |9       |Spark      |5       |SparkGuru         |
    |Ashik          |22 |9       |Spark      |9       |DevHunter         |
    |Ashik          |22 |9       |Spark      |3       |Evangelist        |
    |Ashik          |22 |9       |Spark      |0       |Committer         |
    |Ashik          |22 |9       |Spark      |1       |All Rounder       |
    |Deekshita      |22 |8       |Spark      |2       |SparkSQLMaster    |
    |Deekshita      |22 |8       |Spark      |5       |SparkGuru         |
    |Deekshita      |22 |8       |Spark      |9       |DevHunter         |
    |Deekshita      |22 |8       |Spark      |3       |Evangelist        |
    |Deekshita      |22 |8       |Spark      |0       |Committer         |
    |Deekshita      |22 |8       |Spark      |1       |All Rounder       |
    |Siddhika       |22 |4       |Spark      |2       |SparkSQLMaster    |
    |Siddhika       |22 |4       |Spark      |5       |SparkGuru         |
    |Siddhika       |22 |4       |Spark      |9       |DevHunter         |
    |Siddhika       |22 |4       |Spark      |3       |Evangelist        |
    |Siddhika       |22 |4       |Spark      |0       |Committer         |
    |Siddhika       |22 |4       |Spark      |1       |All Rounder       |
    |Madhu          |22 |3       |Spark      |2       |SparkSQLMaster    |
    |Madhu          |22 |3       |Spark      |5       |SparkGuru         |
    |Madhu          |22 |3       |Spark      |9       |DevHunter         |
    |Madhu          |22 |3       |Spark      |3       |Evangelist        |
    |Madhu          |22 |3       |Spark      |0       |Committer         |
    |Madhu          |22 |3       |Spark      |1       |All Rounder       |
    |Meghna         |22 |2       |Spark      |2       |SparkSQLMaster    |
    |Meghna         |22 |2       |Spark      |5       |SparkGuru         |
    |Meghna         |22 |2       |Spark      |9       |DevHunter         |
    |Meghna         |22 |2       |Spark      |3       |Evangelist        |
    |Meghna         |22 |2       |Spark      |0       |Committer         |
    |Meghna         |22 |2       |Spark      |1       |All Rounder       |
    |Snigdha        |22 |2       |Spark      |2       |SparkSQLMaster    |
    |Snigdha        |22 |2       |Spark      |5       |SparkGuru         |
    |Snigdha        |22 |2       |Spark      |9       |DevHunter         |
    |Snigdha        |22 |2       |Spark      |3       |Evangelist        |
    |Snigdha        |22 |2       |Spark      |0       |Committer         |
    |Snigdha        |22 |2       |Spark      |1       |All Rounder       |
    |Harshita       |22 |6       |Spark      |2       |SparkSQLMaster    |
    |Harshita       |22 |6       |Spark      |5       |SparkGuru         |
    |Harshita       |22 |6       |Spark      |9       |DevHunter         |
    |Harshita       |22 |6       |Spark      |3       |Evangelist        |
    |Harshita       |22 |6       |Spark      |0       |Committer         |
    |Harshita       |22 |6       |Spark      |1       |All Rounder       |
    |Ravi           |42 |0       |Spark      |2       |SparkSQLMaster    |
    |Ravi           |42 |0       |Spark      |5       |SparkGuru         |
    |Ravi           |42 |0       |Spark      |9       |DevHunter         |
    |Ravi           |42 |0       |Spark      |3       |Evangelist        |
    |Ravi           |42 |0       |Spark      |0       |Committer         |
    |Ravi           |42 |0       |Spark      |1       |All Rounder       |
    |Ram            |42 |9       |Spark      |2       |SparkSQLMaster    |
    |Ram            |42 |9       |Spark      |5       |SparkGuru         |
    |Ram            |42 |9       |Spark      |9       |DevHunter         |
    |Ram            |42 |9       |Spark      |3       |Evangelist        |
    |Ram            |42 |9       |Spark      |0       |Committer         |
    |Ram            |42 |9       |Spark      |1       |All Rounder       |
    |Chidananda Raju|35 |9       |Spark      |2       |SparkSQLMaster    |
    |Chidananda Raju|35 |9       |Spark      |5       |SparkGuru         |
    |Chidananda Raju|35 |9       |Spark      |9       |DevHunter         |
    |Chidananda Raju|35 |9       |Spark      |3       |Evangelist        |
    |Chidananda Raju|35 |9       |Spark      |0       |Committer         |
    |Chidananda Raju|35 |9       |Spark      |1       |All Rounder       |
    |Sreekanth Doddy|29 |9       |Spark      |2       |SparkSQLMaster    |
    |Sreekanth Doddy|29 |9       |Spark      |5       |SparkGuru         |
    |Sreekanth Doddy|29 |9       |Spark      |9       |DevHunter         |
    |Sreekanth Doddy|29 |9       |Spark      |3       |Evangelist        |
    |Sreekanth Doddy|29 |9       |Spark      |0       |Committer         |
    |Sreekanth Doddy|29 |9       |Spark      |1       |All Rounder       |
    +---------------+---+--------+-----------+--------+------------------+
    
    == Physical Plan ==
    BroadcastNestedLoopJoin BuildRight, Cross
    :- LocalTableScan [name#0, age#1, personid#2]
    +- BroadcastExchange IdentityBroadcastMode
       +- LocalTableScan [profileName#7, personid#8, profileDescription#9]
    ()
    78
    createOrReplaceTempView example 
    
    Creates a local temporary view using the given name. The lifetime of this
       temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
    
    createOrReplaceTempView  sql 
    SELECT dfperson.name
    , dfperson.age
    , dfprofile.profileDescription
      FROM  dfperson JOIN  dfprofile
     ON dfperson.personid == dfprofile.personid
    
    +---------------+---+------------------+
    |           name|age|profileDescription|
    +---------------+---+------------------+
    |        Nataraj| 45|    SparkSQLMaster|
    |       Srinivas| 45|         SparkGuru|
    |          Ashik| 22|         DevHunter|
    |          Madhu| 22|        Evangelist|
    |         Meghna| 22|    SparkSQLMaster|
    |        Snigdha| 22|    SparkSQLMaster|
    |           Ravi| 42|         Committer|
    |            Ram| 42|         DevHunter|
    |Chidananda Raju| 35|         DevHunter|
    |Sreekanth Doddy| 29|         DevHunter|
    +---------------+---+------------------+
    
    
    
    **** EXCEPT DEMO ***
    
    
     df_asPerson.except(df_asProfile) Except demo
    +---------------+---+--------+
    |           name|age|personid|
    +---------------+---+--------+
    |          Ashik| 22|       9|
    |       Harshita| 22|       6|
    |          Madhu| 22|       3|
    |            Ram| 42|       9|
    |           Ravi| 42|       0|
    |Chidananda Raju| 35|       9|
    |       Siddhika| 22|       4|
    |       Srinivas| 45|       5|
    |Sreekanth Doddy| 29|       9|
    |      Deekshita| 22|       8|
    |         Meghna| 22|       2|
    |        Snigdha| 22|       2|
    |        Nataraj| 45|       2|
    +---------------+---+--------+
    
     df_asProfile.except(df_asPerson) Except demo
    +-----------+--------+------------------+
    |profileName|personid|profileDescription|
    +-----------+--------+------------------+
    |      Spark|       5|         SparkGuru|
    |      Spark|       9|         DevHunter|
    |      Spark|       2|    SparkSQLMaster|
    |      Spark|       3|        Evangelist|
    |      Spark|       0|         Committer|
    |      Spark|       1|       All Rounder|
    +-----------+--------+------------------+
    

    As discussed above these are the venn diagrams of all the joins.

提交回复
热议问题