Best way to prevent fusion in Google Dataflow?
问题 From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation. This is what I came up with in python - is this reasonable / is there a simpler way? def prevent_fuse(collection): return ( collection | beam.Map(lambda x: (x, 1)) | beam.GroupByKey() | beam.FlatMap(lambda x: (x[0] for v in x[1])) ) EDIT, in response to Ben Chambers'