Implement SCD Type 2 in Spark

自闭症网瘾萝莉.ら 提交于 2021-02-18 08:47:47

问题


Trying to implement SCD Type 2 logic in Spark 2.4.4. I've two Data Frames; one containing 'Existing Data' and the other containing 'New Incoming Data'.

Input and expected output are given below. What needs to happen is:

  1. All incoming rows should get appended to the existing data.

  2. Only following 3 rows which were previously 'active' should become inactive with appropriate 'endDate' populated as follows:

    pk=1, amount = 20 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)

    pk=2, amount = 100 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)

    pk=3, amount = 750 => Row should become 'inactive' & 'endDate' is the 'startDate' of following row (Lead)

How do I do this in Spark?

Existing Data:

+---+------+-------------------+-------------------+------+
| pk|amount|          startDate|            endDate|active|
+---+------+-------------------+-------------------+------+
|  1|    10|2019-01-01 12:00:00|2019-01-20 05:00:00|     0|
|  1|    20|2019-01-20 05:00:00|               null|     1|
|  2|   100|2019-01-01 00:00:00|               null|     1|
|  3|    75|2019-01-01 06:00:00|2019-01-26 08:00:00|     0|
|  3|   750|2019-01-26 08:00:00|               null|     1|
| 10|    40|2019-01-01 00:00:00|               null|     1|
+---+------+-------------------+-------------------+------+

New Incoming Data:

+---+------+-------------------+-------------------+------+
| pk|amount|          startDate|            endDate|active|
+---+------+-------------------+-------------------+------+
|  1|    50|2019-02-01 07:00:00|2019-02-02 08:00:00|     0|
|  1|    75|2019-02-02 08:00:00|               null|     1|
|  2|   200|2019-02-01 05:00:00|2019-02-01 13:00:00|     0|
|  2|    60|2019-02-01 13:00:00|2019-02-01 19:00:00|     0|
|  2|   500|2019-02-01 19:00:00|               null|     1|
|  3|   175|2019-02-01 00:00:00|               null|     1|
|  4|    50|2019-02-02 12:00:00|2019-02-02 14:00:00|     0|
|  4|   300|2019-02-02 14:00:00|               null|     1|
|  5|   500|2019-02-02 00:00:00|               null|     1|
+---+------+-------------------+-------------------+------+

Expected Output:

+---+------+-------------------+-------------------+------+
| pk|amount|          startDate|            endDate|active|
+---+------+-------------------+-------------------+------+
|  1|    10|2019-01-01 12:00:00|2019-01-20 05:00:00|     0|
|  1|    20|2019-01-20 05:00:00|2019-02-01 07:00:00|     0|
|  1|    50|2019-02-01 07:00:00|2019-02-02 08:00:00|     0|
|  1|    75|2019-02-02 08:00:00|               null|     1|
|  2|   100|2019-01-01 00:00:00|2019-02-01 05:00:00|     0|
|  2|   200|2019-02-01 05:00:00|2019-02-01 13:00:00|     0|
|  2|    60|2019-02-01 13:00:00|2019-02-01 19:00:00|     0|
|  2|   500|2019-02-01 19:00:00|               null|     1|
|  3|    75|2019-01-01 06:00:00|2019-01-26 08:00:00|     0|
|  3|   750|2019-01-26 08:00:00|2019-02-01 00:00:00|     1|
|  3|   175|2019-02-01 00:00:00|               null|     1|
|  4|    50|2019-02-02 12:00:00|2019-02-02 14:00:00|     0|
|  4|   300|2019-02-02 14:00:00|               null|     1|
|  5|   500|2019-02-02 00:00:00|               null|     1|
| 10|    40|2019-01-01 00:00:00|               null|     1|
+---+------+-------------------+-------------------+------+

回答1:


You can start by selecting the first startDate for each group pk from the new DataFrame and join with the old one to update the desired columns. Then, you can union all the join result and the new DataFrame.

Something like this:

// get first state by date for each pk group
val w = Window.partitionBy($"pk").orderBy($"startDate")
val updates = df_new.withColumn("rn", row_number.over(w)).filter("rn = 1").select($"pk", $"startDate")

// join with old data and update old values when there is match
val joinOldNew = df_old.join(updates.alias("new"), Seq("pk"), "left")
                       .withColumn("endDate", when($"endDate".isNull && $"active" === lit(1) && $"new.startDate".isNotNull, $"new.startDate").otherwise($"endDate"))
                       .withColumn("active", when($"endDate".isNull && $"active" === lit(1) && $"new.startDate".isNotNull, lit(0)).otherwise($"active"))
                       .drop($"new.startDate")

// union all
val result = joinOldNew.unionAll(df_new) 



回答2:


  1. Union 2 data frames
  2. groupByKey on pk
  3. mapGroups will provide a tuple of key and iterator of rows.
  4. On each group, sort rows, iterate over all rows, close records you want and keep the rows you want.
   val df = //read your df containing 
   df.groupByKey( row => (row.getAs[String]("pk")))
     .mapGroups( case (key, rows) => 
     // apply all logic you need to apply per PK. 
     //sort rows by date, survive the latest, close the old )




回答3:


Thanks to answer suggested by @blackbishop I was able to get it working. Here's the working version (in case someone's interested):

    // get first state by date for each pk group
    val w = Window.partitionBy("pk").orderBy("startDate")
    val updates = dfNew.withColumn("rn", row_number.over(w)).filter("rn = 1").select("pk", "startDate")

    // join with old data and update old values when there is match
    val joinOldNew = dfOld.join(updates.alias("new"), Seq("pk"), "left")
        .withColumn("endDate", when(col("endDate").isNull
            && col("active") === lit(1) && col("new.startDate").isNotNull,
            col("new.startDate")).otherwise(col("endDate")))
        .withColumn("active", when(col("endDate").isNull, lit(1))
            .otherwise(lit(0)))
        .drop(col("new.startDate"))


    // union all (Order By is not necessary! Added to facilitate testing.)
    val results = joinOldNew.union(dfNew).orderBy(col("pk"), col("startDate"))


来源:https://stackoverflow.com/questions/59586700/implement-scd-type-2-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!