问题
From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.
This is what I came up with in python - is this reasonable / is there a simpler way?
def prevent_fuse(collection):
return (
collection
| beam.Map(lambda x: (x, 1))
| beam.GroupByKey()
| beam.FlatMap(lambda x: (x[0] for v in x[1]))
)
EDIT, in response to Ben Chambers' question
We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.
回答1:
Apache Beam SDK 2.3.0 adds the experimental Reshuffle transform, which is the Python alternative to the Reshuffle.viaRandomKey
operation mentioned by @BenChambers. You can use it in place of your custom prevent_fuse
code.
回答2:
That should work. There are other ways, but they partly depend on what you are trying to do and why you want to prevent fusion. Keep in mind that fusion is an important optimization to improve the performance of your pipeline.
Could you elaborate on why you want to prevent fusion?
回答3:
A small adjustment to my original proposal - if each item is too large, that will fail will fail. You need to force them into multiple items, so using a constant key doesn't work. So here, you can supply a key
function which needs to differentiate the objects and be small, like a hash.
That said, still not sure this is the best way, or whether something simpler (beam.Partition
?) would work. And would be good for Beam to supply an explicit primitive.
def prevent_fuse(collection, key=None):
"""
prevent a dataflow PCol fusing with the next PCol
supply a key function if the items are too big to use as keys
"""
key = key or (lambda x: x)
return (
collection
| beam.Map(lambda v: (key(v), v))
| beam.GroupByKey()
| beam.FlatMap(lambda kv: (v for v in kv[1]))
)
来源:https://stackoverflow.com/questions/47162365/best-way-to-prevent-fusion-in-google-dataflow