Best way to prevent fusion in Google Dataflow?

别来无恙 提交于 2020-01-14 12:37:50

问题


From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion

You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.

This is what I came up with in python - is this reasonable / is there a simpler way?

def prevent_fuse(collection):
    return (
        collection
        | beam.Map(lambda x: (x, 1))
        | beam.GroupByKey()
        | beam.FlatMap(lambda x: (x[0] for v in x[1]))
        )

EDIT, in response to Ben Chambers' question

We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.


回答1:


Apache Beam SDK 2.3.0 adds the experimental Reshuffle transform, which is the Python alternative to the Reshuffle.viaRandomKey operation mentioned by @BenChambers. You can use it in place of your custom prevent_fuse code.




回答2:


That should work. There are other ways, but they partly depend on what you are trying to do and why you want to prevent fusion. Keep in mind that fusion is an important optimization to improve the performance of your pipeline.

Could you elaborate on why you want to prevent fusion?




回答3:


A small adjustment to my original proposal - if each item is too large, that will fail will fail. You need to force them into multiple items, so using a constant key doesn't work. So here, you can supply a key function which needs to differentiate the objects and be small, like a hash.

That said, still not sure this is the best way, or whether something simpler (beam.Partition?) would work. And would be good for Beam to supply an explicit primitive.

def prevent_fuse(collection, key=None):
    """
    prevent a dataflow PCol fusing with the next PCol
    supply a key function if the items are too big to use as keys
    """

    key = key or (lambda x: x)

    return (
        collection
        | beam.Map(lambda v: (key(v), v))
        | beam.GroupByKey()
        | beam.FlatMap(lambda kv: (v for v in kv[1]))
        )


来源:https://stackoverflow.com/questions/47162365/best-way-to-prevent-fusion-in-google-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!