Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?
问题 I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header. To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table. For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be