Continuous state in Apache Beam pipeline

為{幸葍}努か 提交于 2019-12-01 06:05:10

问题


I'm developing a beam pipeline for dataflow runner. I need the below functionality in my use case.

  1. Read input events from Kafka topic(s). Each Kafka message value derives [userID, Event] pair.
  2. For each userID, I need to maintain a profile and based on the current Event, a possible update to the profile is possible. If the profile is updated:
    • Updated profile is written to output stream.
    • The next Event for that userID in the pipeline should refer to the updated profile.

I was thinking of using the provided state functionality in Beam, without depending on an external key-value store for maintaining the user profile. Is this feasible with the current version of beam (2.1.0) and dataflow runner? If I understand correctly the state is scoped to the elements in a single window firing (i.e even for a GlobalWindow, the state will be scoped to the elements in a single firing of the window caused by a trigger). Am I missing something here?


回答1:


State would be perfectly appropriate for your use case.

The only correction is that state is scoped to a single window, but trigger firings do not affect it. So, if your state is small you can store it in a global window. When a new element arrives, you can use use the state, output elements as needed, and make changes to the state.

The only thing to consider would be if you have an unbounded number of user IDs, how big the state may become. For instance, you may want an inactivity timer to clear old user state after some period of time.

If you haven't read them, the blog posts Stateful Processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam provide a good introduction to these concepts and APIs.



来源:https://stackoverflow.com/questions/47206265/continuous-state-in-apache-beam-pipeline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!