How to refresh meta data dataframe in streaming app in every 5 min?

蓝咒 提交于 2020-01-11 13:16:06

问题


I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.

I have a scenario for some finance data coming from kafka topic, say financeDf I need to remap some of the fields from a metaDataDf = //loaded from cassandra table for look out. But this cassandra table (metaDataDf ) can be updated once in an hour.

In spark-sql strucutred streaming application how should I get latest data from cassandra table for every one hour?

I dont want to load this data metaDataDf for each record I receive from topic i.e. financeDf.

How this should be done/handled ? any help please..


回答1:


You have more options how to do something like that, basically when you are googling try to focus on spark enrichment with static data .There are already some answers on Stack overflow.

The main problem for you is data refresh. It depends on your needs and if you can sacrify some precision, respectively if you need to remap directly after change of Cassandra or not. Some possible solutions:

  1. Introduce some special event in Kafka, which will be created by external system and will notify you that your Cassandra has been changed (this is fully accurate and updated immediately)
  2. Introduce constant input dstream or maybe there is similar mechanism in structured streaming. Basically, it will create separate output operation, which will read Cassandra each streaming interval and update cache if it is different (of course it is not updated immediately after change, but at the nearest streaming interval).
  3. I have also seen solution with window functionality, but again it is updated only after some time.

Of course, there are other possibilities, everything depends on your preference.



来源:https://stackoverflow.com/questions/59653240/how-to-refresh-meta-data-dataframe-in-streaming-app-in-every-5-min

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!