generating join condition dynamically in spark/scala

*爱你&永不变心* 提交于 2021-02-08 07:56:37

问题


I want to be able to pass the join condition for two data frames as an input string. The idea is to make the join generic enough so that the user could pass on the condition they like.

Here's how I am doing it right now. Although it works, I think its not clean.

val testInput =Array("a=b", "c=d")
val condition: Column = testInput.map(x => testMethod(x)).reduce((a,b) => a.and(b))
firstDataFrame.join(secondDataFrame, condition, "fullouter")

Here's the testMethod

def testMethod(inputString: String): Column = {
  val splitted = inputString.split("=")
  col(splitted.apply(0)) === col(splitted.apply(1))
}

Need help in figuring out a better way of taking input to generate the join condition dynamically


回答1:


Not sure custom method like such would provide too much benefit, but if you must go down that path I would recommend making it cover also join on:

  1. columns of the same name (which is rather common)
  2. inequality condition

Sample code below:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._

def joinDFs(dfL: DataFrame, dfR: DataFrame, conditions: List[String], joinType: String) = {
  val joinConditions = conditions.map( cond => {
      val arr = cond.split("\\s+")
      if (arr.size != 3) throw new Exception("Invalid join conditions!") else
        arr(1) match {
          case "<"  => dfL(arr(0)) <   dfR(arr(2))
          case "<=" => dfL(arr(0)) <=  dfR(arr(2))
          case "="  => dfL(arr(0)) === dfR(arr(2))
          case ">=" => dfL(arr(0)) >=  dfR(arr(2))
          case ">"  => dfL(arr(0)) >   dfR(arr(2))
          case "!=" => dfL(arr(0)) =!= dfR(arr(2))
          case _ => throw new Exception("Invalid join conditions!")
        }
    } ).
    reduce(_ and _)

  dfL.join(dfR, joinConditions, joinType)
}

val dfLeft = Seq(
  (1, "2018-04-01", "p"),
  (1, "2018-04-01", "q"),
  (2, "2018-05-01", "r")
).toDF("id", "date", "value")

val dfRight = Seq(
  (1, "2018-04-15", "x"),
  (2, "2018-04-15", "y")
).toDF("id", "date", "value")

val conditions = List("id = id", "date <= date")

joinDFs(dfLeft, dfRight, conditions, "left_outer").
  show
// +---+----------+-----+----+----------+-----+
// | id|      date|value|  id|      date|value|
// +---+----------+-----+----+----------+-----+
// |  1|2018-04-01|    p|   1|2018-04-15|    x|
// |  1|2018-04-01|    q|   1|2018-04-15|    x|
// |  2|2018-05-01|    r|null|      null| null|
// +---+----------+-----+----+----------+-----+


来源:https://stackoverflow.com/questions/50244045/generating-join-condition-dynamically-in-spark-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!