Dataflow GroupByKey and CoGroupByKey is very slow

时光总嘲笑我的痴心妄想 提交于 2020-05-26 10:16:08

问题


Dataflow works great for pipelines with simple transforms but when we have complex transforms such as joins the performance is really bad.


回答1:


I wrote this question to answer it myself.

What's happeninig under the hood:

  • The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated.
  • Recently I was working on a huge dataset with 970 columns of information. Now this data is huge (PCollecction Around 300 GB).In my usecase I had to join this info with another PCollection(PCollecction with 5 columns) with CoGroupByKey and when the data was passed as is, it took hours to get the data ready before it could even perform the Grouping operation.

Workaround

  • Reduce the row size so that you only have the requried information to join the collection

For instance if you have 1 key column in the left collection and need 1 value column from the right collection, pass only those information to the CoGroupByKey Transforms. While this does result in data loss of your original collection, you will have created a lookup collection which has KV based information of what you need.

You can then use a DoFn to traverse over your data, reconstruct the key and fetch the data from the data from the Map passed as side input.

This approached gives amazing results and it enabled me to join the data mentioned above within 5 minutes.



来源:https://stackoverflow.com/questions/52229309/dataflow-groupbykey-and-cogroupbykey-is-very-slow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!