A way to execute pipeline periodically from bounded source in Apache Beam

家住魔仙堡 提交于 2019-12-22 17:55:55

问题


I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner. It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.

Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds? Or is there a way to make the pipeline redoing it automatically without having to submit another job?

It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.

Any help would be welcome!


回答1:


This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?

  • You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.

  • If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.

That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.




回答2:


Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?

Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.

Or is there a way to make the pipeline redoing it automatically without having to submit another job?

A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.



来源:https://stackoverflow.com/questions/48712142/a-way-to-execute-pipeline-periodically-from-bounded-source-in-apache-beam

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!