Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)
问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not