问题
I am using spark-sql 2.4.1v with java8.
I have a scenario where I have some meta data in dataset1 i.e. which is loaded from an HDFS Parquet file.
And I have another dataset2 which is read from a Kafka Stream.
For each record from dataset2 for column1 I need to check columnX in dataset2
if its there in dataset1.
If it is there in dataset1,then I need replace the columnX value with column1 value of dataset1
Else
I need to add increment (max(column1 ) from dataset1 ) by one and store it dataset1.
Some sample data you can see here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3447405230020171/7035720262824085/latest.html
How this can be done in sSpark?
Example:
val df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
).toDF("company_id_external","company_id")
val df2 = Seq(
("60886923","Chengdu Fuma Food Co,.Ltd"), //company_id_external match found in df1
("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
("583633","Boso oil and fat Co., Ltd. ") //company_id_external match found in df1
)toDF("company_id_external","companyName")
If match found in df1
Here only two records of df1 "company_id_external" matching to df2 "company_id_external"
i.e. 60886923 & 583633 ( first and last record)
For these records of df2
i.e. ("60886923","Chengdu Fuma Food Co,.Ltd") becomes ==> ("2860","Chengdu Fuma Food Co,.Ltd")
("583633","Boso oil and fat Co., Ltd. ") becomes ==> ("46067","Boso oil and fat Co., Ltd. ")
Else match not found in df1
For other two of df2 there is no "company_id_external" match in df1, need to generate it company_id and add to df1 i.e. ("608815923","Australia Deloraine Dairy Pty Ltd ), ("59322769","Consalac B.V.")
company_id generation logic new company_id = max(company_id) of df1 + 1 From the above max is 50330 + 1 => 50331 add this record to df1 i.e. ("608815923","50331") Do the same for other one i.e. add this record to df1 i.e. ("583633","50332")
**So now**
df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
("608815923","50331")
("583633","50332")
).toDF("company_id_external","company_id")
来源:https://stackoverflow.com/questions/57479581/updating-static-source-based-on-kafka-stream-using-spark-streaming