问题
I'm developing a beam pipeline for dataflow runner. I need the below functionality in my use case.
- Read input events from Kafka topic(s). Each Kafka message value derives
[userID, Event]pair. - For each
userID, I need to maintain aprofileand based on the currentEvent, a possible update to theprofileis possible. If theprofileis updated:- Updated
profileis written to output stream. - The next
Eventfor thatuserIDin the pipeline should refer to the updated profile.
- Updated
I was thinking of using the provided state functionality in Beam, without depending on an external key-value store for maintaining the user profile. Is this feasible with the current version of beam (2.1.0) and dataflow runner? If I understand correctly the state is scoped to the elements in a single window firing (i.e even for a GlobalWindow, the state will be scoped to the elements in a single firing of the window caused by a trigger). Am I missing something here?
回答1:
State would be perfectly appropriate for your use case.
The only correction is that state is scoped to a single window, but trigger firings do not affect it. So, if your state is small you can store it in a global window. When a new element arrives, you can use use the state, output elements as needed, and make changes to the state.
The only thing to consider would be if you have an unbounded number of user IDs, how big the state may become. For instance, you may want an inactivity timer to clear old user state after some period of time.
If you haven't read them, the blog posts Stateful Processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam provide a good introduction to these concepts and APIs.
来源:https://stackoverflow.com/questions/47206265/continuous-state-in-apache-beam-pipeline