How to check streaming data record already there in historical/persisted data in spark streaming?

强颜欢笑 提交于 2019-11-26 14:55:27

问题


For my PoC, I am using spark-sql 2.4.x with Kafka. I have a streaming company data coming from Kafka topic. Company data Which has "company_id" ,"created_date" ,"field1" , "field2" and etc as fields. lets say this as newCompanyDataStream.

I have old company data in my parquet file. i.e. "hdfs://parquet/company" , lets say this as oldCompanyDataDf.

I need to check the new data stream from kafka (newCompanyDataStream) , for each received record of given company_id , if the data already there in the "hdfs://parquet/company" file. (oldCompanyDataDf)

How to check this?

If newCompanyDataStream "field1" and oldCompanyDataDf "field1" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field1" record into oldCompanyDataDf )

If newCompanyDataStream "field2" and oldCompanyDataDf "field2" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field2" record into oldCompanyDataDf )

How to implement this using spark-sql structured streaming?

For Any snippet or advice is very much thankful

来源:https://stackoverflow.com/questions/57610448/how-to-check-streaming-data-record-already-there-in-historical-persisted-data-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!