How to join two DataFrames in Scala and Apache Spark?

匿名 (未验证) 提交于 2019-12-03 03:06:01

问题:

There are two DataFrames (Scala, Apache Spark 1.6.1)

1) Matches

         MatchID | Player1    |  Player2           --------------------------------                1 | John Wayne | John Doe                2 | Ive Fish   | San Simon 

2) Personal Data

              Player     |  BirthYear                --------------------------------               John Wayne | 1986               Ive Fish   | 1990               San Simon  | 1974               john Doe   | 1995 

How could create a new DataFrame with 'BirthYear' for the both players

         MatchID | Player1    |  Player2  | BYear_P1 |BYear_P2 | Diff          -------------------------------------------------------------                1 | John Wayne | John Doe  |   1986   | 1995    |  9                  2 | Ive Fish   | San Simon |   1990   | 1974    |  16 

?

I tried

    val df = MatchesDF.join(PersonalDF, MatchesDF("Player1") === PersonalDF("Player")) 

then join again for the second player

    val resDf = df.join(PersonalDF, df("Player2") === PersonalDF("Player")) 

but it's VERY time consuming operation.

May be another way to do it in Scala and Apache Spark?

回答1:

This should perform better:

case class Match(matchId: Int, player1: String, player2: String) case class Player(name: String, birthYear: Int)  val matches = Seq(   Match(1, "John Wayne", "John Doe"),   Match(2, "Ive Fish", "San Simon") )  val players = Seq(   Player("John Wayne", 1986),   Player("Ive Fish", 1990),   Player("San Simon", 1974),   Player("John Doe", 1995) )  val matchesDf = sqlContext.createDataFrame(matches) val playersDf = sqlContext.createDataFrame(players)  matchesDf.registerTempTable("matches") playersDf.registerTempTable("players")  sqlContext.sql(   "select matchId, player1, player2, p1.birthYear, p2.birthYear, abs(p1.birthYear-p2.birthYear) " +   "from matches m inner join  players p1 inner join players p2 " +   "where m.player1 = p1.name and m.player2 = p2.name").show()  +-------+----------+---------+---------+---------+---+ |matchId|   player1|  player2|birthYear|birthYear|_c5| +-------+----------+---------+---------+---------+---+ |      1|John Wayne| John Doe|     1986|     1995|  9| |      2|  Ive Fish|San Simon|     1990|     1974| 16| +-------+----------+---------+---------+---------+---+ 

I didn't find the way to express join of 3 tables in Scala DSL.



回答2:

This is a solution using spark's dataframe functions:

import sqlContext.implicits._ import org.apache.spark.sql.Row import org.apache.spark.sql.functions.abs  val matches = sqlContext.sparkContext.parallelize(Row(1, "John Wayne", "John Doe"), Row(2, "Ive Fish", "San Simon")))  val players = sqlContext.sparkContext.parallelize(Seq(   Row("John Wayne", 1986),   Row("Ive Fish", 1990),   Row("San Simon", 1974),   Row("John Doe", 1995) ))  val matchesDf = sqlContext.createDataFrame(matches, StructType(Seq(   StructField("matchId", IntegerType, nullable = false),   StructField("player1", StringType, nullable = false),   StructField("player2", StringType, nullable = false))) ).as('matches)  val playersDf = sqlContext.createDataFrame(players, StructType(Seq(   StructField("player", StringType, nullable = false),   StructField("birthYear", IntegerType, nullable = false) ))).as('players)  matchesDf   .join(playersDf, $"matches.player1" === $"players.player")   .select($"matches.matchId" as "matchId", $"matches.player1" as "player1", $"matches.player2" as "player2", $"players.birthYear" as "player1BirthYear")   .join(playersDf, $"player2" === $"players.player")   .select($"matchId" as "MatchID", $"player1" as "Player1", $"player2" as "Player2", $"player1BirthYear" as "BYear_P1", $"players.birthYear" as "BYear_P2")   .withColumn("Diff", abs('BYear_P2.minus('BYear_P1)))   .show()  +-------+----------+---------+--------+--------+----+ |MatchID|   Player1|  Player2|BYear_P1|BYear_P2|Diff| +-------+----------+---------+--------+--------+----+ |      1|John Wayne| John Doe|    1986|    1995|   9| |      2|  Ive Fish|San Simon|    1990|    1974|  16| +-------+----------+---------+--------+--------+----+ 


回答3:

val df = left.join(right, Seq("name")) display(df) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!