Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

问题

I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.

To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.

For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.

Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?

(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)

回答1:

Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.

It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.

来源：https://stackoverflow.com/questions/57360621/can-i-force-a-step-in-my-dataflow-pipeline-to-be-single-threaded-and-on-a-singl

标签

java

google-cloud-platform

google-cloud-dataflow

apache-beam

thread-synchronization

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!