问题
Dataflow works great for pipelines with simple transforms but when we have complex transforms such as joins the performance is really bad.
回答1:
I wrote this question to answer it myself.
What's happeninig under the hood:
- The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated.
- Recently I was working on a huge dataset with 970 columns of information. Now this data is huge (PCollecction Around 300 GB).In my usecase I had to join this info with another PCollection(PCollecction with 5 columns) with CoGroupByKey and when the data was passed as is, it took hours to get the data ready before it could even perform the Grouping operation.
Workaround
- Reduce the row size so that you only have the requried information to join the collection
For instance if you have 1 key column in the left collection and need 1 value column from the right collection, pass only those information to the CoGroupByKey Transforms. While this does result in data loss of your original collection, you will have created a lookup collection which has KV based information of what you need.
You can then use a DoFn to traverse over your data, reconstruct the key and fetch the data from the data from the Map passed as side input.
This approached gives amazing results and it enabled me to join the data mentioned above within 5 minutes.
来源:https://stackoverflow.com/questions/52229309/dataflow-groupbykey-and-cogroupbykey-is-very-slow