Updating static source based on Kafka Stream using Spark Streaming?

问题

I am using spark-sql 2.4.1v with java8.

I have a scenario where I have some meta data in dataset1 i.e. which is loaded from an HDFS Parquet file.

And I have another dataset2 which is read from a Kafka Stream.

For each record from dataset2 for column1 I need to check  columnX in dataset2 
if its there in dataset1. 

If it is there in dataset1,then I need replace the columnX value with column1 value  of dataset1 
Else
  I need to add increment (max(column1 ) from dataset1 ) by one and store it dataset1.

Some sample data you can see here:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3447405230020171/7035720262824085/latest.html

How this can be done in sSpark?

Example:

val df1 = Seq(
  ("20359045","2263"),
("8476349","3280"),
("60886923","2860"),  
("204831453","50330"),
("6487533","48236"),
("583633","46067"),  
  ).toDF("company_id_external","company_id")

 val df2 = Seq(
  ("60886923","Chengdu Fuma Food Co,.Ltd"), //company_id_external match found in df1 
("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
("583633","Boso oil and fat Co., Ltd.      ") //company_id_external match found in df1
  )toDF("company_id_external","companyName")

If match found in df1

Here only two records of df1 "company_id_external" matching to df2 "company_id_external"
    i.e. 60886923 & 583633  ( first and last record)

    For these records of df2  
    i.e. ("60886923","Chengdu Fuma Food Co,.Ltd")  becomes ==> ("2860","Chengdu Fuma Food Co,.Ltd")
          ("583633","Boso oil and fat Co., Ltd.      ")  becomes ==>  ("46067","Boso oil and fat Co., Ltd.      ")

Else match not found in df1

For other two of df2 there is no "company_id_external" match in df1, need to generate it company_id and add to df1 i.e. ("608815923","Australia Deloraine Dairy Pty Ltd ), ("59322769","Consalac B.V.")

company_id generation logic new company_id = max(company_id) of df1 + 1 From the above max is 50330 + 1 => 50331 add this record to df1 i.e. ("608815923","50331") Do the same for other one i.e. add this record to df1 i.e. ("583633","50332")

 **So now** 

df1 = Seq(
        ("20359045","2263"),
        ("8476349","3280"),
        ("60886923","2860"),  
        ("204831453","50330"),
        ("6487533","48236"),
        ("583633","46067"), 
        ("608815923","50331")
        ("583633","50332")
          ).toDF("company_id_external","company_id")

来源：https://stackoverflow.com/questions/57479581/updating-static-source-based-on-kafka-stream-using-spark-streaming

标签

apache-spark

apache-kafka

apache-spark-sql

kafka-consumer-api

spark-structured-streaming