问题
I have got a requirement to do , but I am confused how to do it. I have two dataframes. so first time i got the below data file1
file1 prodid, lastupdatedate, indicator
00001,,A
00002,01-25-1981,A
00003,01-26-1982,A
00004,12-20-1985,A
the output should be
0001,1900-01-01, 2400-01-01, A
0002,1981-01-25, 2400-01-01, A
0003,1982-01-26, 2400-01-01, A
0004,1985-12-20, 2400-10-01, A
Second time i got another one file2
prodid, lastupdatedate, indicator
00002,01-25-2018,U
00004,01-25-2018,U
00006,01-25-2018,A
00008,01-25-2018,A
I want the end result like
00001,1900-01-01,2400-01-01,A
00002,1981-01-25,2018-01-25,I
00002,2018-01-25,2400-01-01,A
00003,1982-01-26,2400-01-01,A
00004,1985-12-20,2018-01-25,I
00004,2018-01-25,2400-01-01,A
00006,2018-01-25,2400-01-01,A
00008,2018-01-25,2400-01-01,A
so whatever the updates are there in the second file that date should come in the second column and the default date (2400-01-01) should come in the third column and the relavant indicator. The default indicator is A
I have started like this
val spark=SparkSession.builder()
.master("local")
.appName("creating data frame for csv")
.getOrCreate()
import spark.implicits._
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod.txt")
val df1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod1.txt")
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
if((df1("indicator")=='U') && (df1("prodid")== newdf("prodid"))){
val df3 = df1.except(newdf)
}
回答1:
You should join them with prodid and use some when function to manipulate the dataframes to the expected output. You should filter the updated dataframes for second rows and merge them back (I have included comments for explaining each part of the code)
import org.apache.spark.sql.functions._
//filling empty lastupdatedate and changing the date to the expected format
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//changing the date to the expected format of the second dataframe
val newdf1 = df1.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//joining both dataframes and updating columns according to your needs
val tempdf = newdf.as("table1").join(newdf1.as("table2"),Seq("prodid"), "outer")
.select(col("prodid"),
when(col("table1.lastupdatedate").isNotNull, col("table1.lastupdatedate")).otherwise(col("table2.lastupdatedate")).as("lastupdatedate"),
when(col("table1.indicator").isNotNull, when(col("table2.lastupdatedate").isNotNull, col("table2.lastupdatedate")).otherwise(lit("2400-01-01"))).otherwise(lit("2400-01-01")).as("defaultdate"),
when(col("table2.indicator").isNull, col("table1.indicator")).otherwise(when(col("table2.indicator") === "U", lit("I")).otherwise(col("table2.indicator"))).as("indicator"))
//filtering tempdf for duplication
val filtereddf = tempdf.filter(col("indicator") === "I")
.withColumn("lastupdatedate", col("defaultdate"))
.withColumn("defaultdate", lit("2400-01-01"))
.withColumn("indicator", lit("A"))
//finally merging both dataframes
tempdf.union(filtereddf).sort("prodid", "lastupdatedate").show(false)
which should give you
+------+--------------+-----------+---------+
|prodid|lastupdatedate|defaultdate|indicator|
+------+--------------+-----------+---------+
|1 |1900-01-01 |2400-01-01 |A |
|2 |1981-01-25 |2018-01-25 |I |
|2 |2018-01-25 |2400-01-01 |A |
|3 |1982-01-26 |2400-01-01 |A |
|4 |1985-12-20 |2018-01-25 |I |
|4 |2018-01-25 |2400-01-01 |A |
|6 |2018-01-25 |2400-01-01 |A |
|8 |2018-01-25 |2400-01-01 |A |
+------+--------------+-----------+---------+
来源:https://stackoverflow.com/questions/49658853/change-data-capture-in-spark