How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?

為{幸葍}努か 提交于 2019-12-18 09:09:37

问题


I am using Google Cloud Dataflow with the Python SDK.

I would like to :

  • Get a list of unique dates out of a master PCollection
  • Loop through the dates in that list to create filtered PCollections (each with a unique date), and write each filtered PCollection to its partition in a time-partitioned table in BigQuery.

How can I get that list ? After the following combine transform, I created a ListPCollectionView object but I cannot iterate that object :

class ToUniqueList(beam.CombineFn):

    def create_accumulator(self):
        return []

    def add_input(self, accumulator, element):
        if element not in accumulator:
            accumulator.append(element)
        return accumulator

    def merge_accumulators(self, accumulators):
        return list(set(accumulators))

    def extract_output(self, accumulator):
        return accumulator


def get_list_of_dates(pcoll):

    return (pcoll
            | 'get the list of dates' >> beam.CombineGlobally(ToUniqueList()))

Am I doing it all wrong ? What is the best way to do that ?

Thanks.


回答1:


It is not possible to get the contents of a PCollection directly - an Apache Beam or Dataflow pipeline is more like a query plan of what processing should be done, with PCollection being a logical intermediate node in the plan, rather than containing the data. The main program assembles the plan (pipeline) and kicks it off.

However, ultimately you're trying to write data to BigQuery tables sharded by date. This use case is currently supported only in the Java SDK and only for streaming pipelines.

For a more general treatment of writing data to multiple destinations depending on the data, follow BEAM-92.

See also Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow



来源:https://stackoverflow.com/questions/41440634/how-to-get-a-list-of-elements-out-of-a-pcollection-in-google-dataflow-and-use-it

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!