A simple counting step following a group by key is extremely slow in a DataFlow pipeline

若如初见. 提交于 2019-12-05 03:55:06

问题


I have a DataFlow pipeline trying to build an index (key-value pairs) and compute some metrics (like a number of values per key). The input data is about 60 GB total, stored on GCS and the pipeline has about 126 workers allocated. Per Stackdriver all workers have about 6% CPU utilization.

The pipeline seems to make no progress despite having 126 workers and based on the wall time the bottleneck seems to be a simple counting step that follows a group by. While all other steps have on average less than 1 hour spent in them, the counting step took already 50 days of the wall time. There seems to be no helpful information all warnings in the log.

The counting step was implemented following a corresponding step in the WordCount example:

def count_keywords_per_product(self, key_and_group):
    key, group = key_and_group
    count = 0
    for e in group:
        count += 1

    self.stats.product_counter.inc()
    self.stats.keywords_per_product_dist.update(count)

    return (key, count)

The preceding step "Group keywords" is a simple beam.GroupByKey() transformation.

Please advise what might be the reason and how this can be optimized.

Current resource metrics:
Current vCPUs    126
Total vCPU time      1,753.649 vCPU hr
Current memory   472.5 GB
Total memory time    6,576.186 GB hr
Current PD   3.08 TB
Total PD time    43,841.241 GB hr
Current SSD PD   0 B
Total SSD PD time    0 GB hr
Total Shuffle data processed     1.03 TB
Billable Shuffle data processed      529.1 GB

The pipeline steps including the counting one can be seen below:


回答1:


The best way of having a sum per key here is to use a combine operation. The reason is that it can alleviate the problem of having hot keys.

Try replacing your GroupByKey + ParDo with a beam.combiners.Count.PerKey, or a similar combine transform that suits your use case.



来源:https://stackoverflow.com/questions/55401268/a-simple-counting-step-following-a-group-by-key-is-extremely-slow-in-a-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!