问题
I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
Question: how can we assign rank/row numbers to the elements in a pcollection?.
回答1:
This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.
I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.
来源:https://stackoverflow.com/questions/48284526/ranking-pcollection-elements