change data capture in spark

泪湿孤枕 提交于 2021-01-05 11:52:43

问题


I have got a requirement to do , but I am confused how to do it. I have two dataframes. so first time i got the below data file1

file1 prodid, lastupdatedate, indicator

00001,,A
00002,01-25-1981,A
00003,01-26-1982,A
00004,12-20-1985,A

the output should be

0001,1900-01-01, 2400-01-01, A
0002,1981-01-25, 2400-01-01, A
0003,1982-01-26, 2400-01-01, A
0004,1985-12-20, 2400-10-01, A

Second time i got another one file2

prodid, lastupdatedate, indicator

00002,01-25-2018,U
00004,01-25-2018,U
00006,01-25-2018,A
00008,01-25-2018,A

I want the end result like

00001,1900-01-01,2400-01-01,A
00002,1981-01-25,2018-01-25,I
00002,2018-01-25,2400-01-01,A
00003,1982-01-26,2400-01-01,A
00004,1985-12-20,2018-01-25,I
00004,2018-01-25,2400-01-01,A
00006,2018-01-25,2400-01-01,A
00008,2018-01-25,2400-01-01,A

so whatever the updates are there in the second file that date should come in the second column and the default date (2400-01-01) should come in the third column and the relavant indicator. The default indicator is A

I have started like this

val spark=SparkSession.builder()
    .master("local")
    .appName("creating data frame for csv")
    .getOrCreate()
   
    import spark.implicits._ 
    val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("d:/prod.txt")
  
    val df1 = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("d:/prod1.txt")
  

val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))

if((df1("indicator")=='U') && (df1("prodid")== newdf("prodid"))){
    val df3 = df1.except(newdf)
    }

回答1:


You should join them with prodid and use some when function to manipulate the dataframes to the expected output. You should filter the updated dataframes for second rows and merge them back (I have included comments for explaining each part of the code)

import org.apache.spark.sql.functions._
//filling empty lastupdatedate and changing the date to the expected format
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
  .withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))

//changing the date to the expected format of the second dataframe
val newdf1 = df1.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))

//joining both dataframes and updating columns according to your needs
val tempdf = newdf.as("table1").join(newdf1.as("table2"),Seq("prodid"), "outer")
    .select(col("prodid"),
      when(col("table1.lastupdatedate").isNotNull, col("table1.lastupdatedate")).otherwise(col("table2.lastupdatedate")).as("lastupdatedate"),
      when(col("table1.indicator").isNotNull, when(col("table2.lastupdatedate").isNotNull, col("table2.lastupdatedate")).otherwise(lit("2400-01-01"))).otherwise(lit("2400-01-01")).as("defaultdate"),
      when(col("table2.indicator").isNull, col("table1.indicator")).otherwise(when(col("table2.indicator") === "U", lit("I")).otherwise(col("table2.indicator"))).as("indicator"))

//filtering tempdf for duplication
val filtereddf = tempdf.filter(col("indicator") === "I")
                        .withColumn("lastupdatedate", col("defaultdate"))
                        .withColumn("defaultdate", lit("2400-01-01"))
                        .withColumn("indicator", lit("A"))

//finally merging both dataframes
tempdf.union(filtereddf).sort("prodid", "lastupdatedate").show(false)

which should give you

+------+--------------+-----------+---------+
|prodid|lastupdatedate|defaultdate|indicator|
+------+--------------+-----------+---------+
|1     |1900-01-01    |2400-01-01 |A        |
|2     |1981-01-25    |2018-01-25 |I        |
|2     |2018-01-25    |2400-01-01 |A        |
|3     |1982-01-26    |2400-01-01 |A        |
|4     |1985-12-20    |2018-01-25 |I        |
|4     |2018-01-25    |2400-01-01 |A        |
|6     |2018-01-25    |2400-01-01 |A        |
|8     |2018-01-25    |2400-01-01 |A        |
+------+--------------+-----------+---------+


来源:https://stackoverflow.com/questions/49658853/change-data-capture-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!