How to do a cartesian product of two PCollections in Dataflow?

自古美人都是妖i 提交于 2019-12-23 15:58:23

问题


I would like to do a cartesian product of two PCollections. Neither PCollection can fit into memory, so doing side input is not feasible.

My goal is this: I have two datasets. One is many elements of small size. The other is few (~10) of very large size. I would like to take the product of these two elements and then produce key-value objects.


回答1:


I think CoGroupByKey might work in your situation:

https://cloud.google.com/dataflow/model/group-by-key#join

That's what I did for a similar use-case. Though mine had probably not been constrained by the memory (have you tried a larger cluster with bigger machines?):

  PCollection<KV<String, TableRow>> inputClassifiedKeyed = inputClassified
            .apply(ParDo.named("Actuals : Keys").of(new ActualsRowToKeyedRow()));

  PCollection<KV<String, Iterable<Map<String, String>>>> groupedCategories = p
  [...]
   .apply(GroupByKey.create());

So the collections are keyed by the same key.

Then I declared the Tags:

 final TupleTag<Iterable<Map<String, String>>> categoryTag = new TupleTag<>();
 final TupleTag<TableRow> actualsTag = new TupleTag<>();

Combined them:

 PCollection<KV<String, CoGbkResult>> actualCategoriesCombined =
            KeyedPCollectionTuple.of(actualsTag, inputClassifiedKeyed)
                    .and(categoryTag, groupedCategories)
                    .apply(CoGroupByKey.create());

And in my case the final step - reformatting the results (from the tagged groups in the continuous flow:

   actualCategoriesCombined
            .apply(
                    ParDo.named("Actuals : Formatting")
                            .of(
                                    new DoFn<KV<String, CoGbkResult>, TableRow>() {
                                        @Override
                                        public void processElement(ProcessContext c) throws Exception {

                                            KV<String, CoGbkResult> e = c.element();

                                            Iterable<TableRow> actualTableRows = e.getValue().getAll(actualsTag);
                                            Iterable<Iterable<Map<String, String>>> categoriesAll = e.getValue().getAll(categoryTag);

                                            for (TableRow row : actualTableRows) {

                                                // Some of the actuals do not have categories
                                                if (categoriesAll.iterator().hasNext()) {
                                                    row.put("advertiser", categoriesAll.iterator().next());
                                                }

                                                c.output(row);
                                            }
                                        }
                                    }
                            )
            )

Hope this helps. Again - not sure about the in memory constraints. Please do tell the results if you try this approach.




回答2:


to create cartesian product use Apache Beam extension Join

import org.apache.beam.sdk.extensions.joinlibrary.Join;

...

// Use function Join.fullOuterJoin(final PCollection<KV<K, V1>> leftCollection, final PCollection<KV<K, V2>> rightCollection, final V1 leftNullValue, final V2 rightNullValue)
// and the same key for all rows to create cartesian product as it is shown below:

    public static void process(Pipeline pipeline, DataInputOptions options) {
        PCollection<KV<Integer, CpuItem>> cpuList = pipeline
                .apply("ReadCPUs", TextIO.read().from(options.getInputCpuFile()))
                .apply("Creating Cpu Objects", new CpuItem()).apply("Preprocess Cpu",
                        MapElements
                                .into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(CpuItem.class)))
                                .via((CpuItem e) -> KV.of(0, e)));

        PCollection<KV<Integer, GpuItem>> gpuList = pipeline
                .apply("ReadGPUs", TextIO.read().from(options.getInputGpuFile()))
                .apply("Creating Gpu Objects", new GpuItem()).apply("Preprocess Gpu",
                        MapElements
                                .into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(GpuItem.class)))
                                .via((GpuItem e) -> KV.of(0, e)));

        PCollection<KV<Integer,KV<CpuItem,GpuItem>>>  cartesianProduct = Join.fullOuterJoin(cpuList, gpuList, new CpuItem(), new GpuItem());
        PCollection<String> finalResultCollection = cartesianProduct.apply("Format results", MapElements.into(TypeDescriptors.strings())
                .via((KV<Integer, KV<CpuItem,GpuItem>> e) -> e.getValue().toString()));
        finalResultCollection.apply("Output the results",
                TextIO.write().to("fps.batchproc\\parsed_cpus").withSuffix(".log"));
        pipeline.run();
    }

in the code above in this line

...
        .via((CpuItem e) -> KV.of(0, e)));
...

i create Map with key equals to 0 for all rows available in the input data. As the result all rows are matched. That is equal to SQL expression JOIN without WHERE clause



来源:https://stackoverflow.com/questions/41050477/how-to-do-a-cartesian-product-of-two-pcollections-in-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!