How to Unnest the nested PCollection in Dataflow

烂漫一生 提交于 2019-12-24 21:23:41

问题


To Join two nested structure PCollection, we need to Unnest the PCollection before doing join, as getting challenges (refer my other stackoverflow case a link). So want to know how to unnest the PCollection. It would be good if some one give idea either Join two nested table or how to unnest PCollections.

I just noted that we have PTransform "Unnest" (link) for unnesting collection from the nested one. But I could not find any sample on net. However I just tried to implement it with below steps to convert nested collection, but still unable to get the unnest Collection in last.

1) PCollection empCollection = ReadCollection(); 2) Using Pardo function convert the value from PCollection (com.google.api.services.bigquery.model.TableRow) to PCollection(org.apache.beam.sdk.values.Row) 3) Define the Schema like below Schema projects = Schema.builder().addInt32Field("Id").addStringField("Name").build(); Schema Employees = Schema.builder().addStringField("empNo").addStringField("empName").addArrayField("Projects", FieldType.row(projects)).build(); 4) Use Unnest transform to unnest the nested collection

PCollection<Row> pcColl = targetRowCollection.apply(Unnest.<Row>create().withFieldNameFunction(new SerializableFunction<java.util.List<java.lang.String>, java.lang.String>() {
@Override
public java.lang.String apply(java.util.List<java.lang.String> input) {
    return String.join("+", input);
    }
}));

5) Using Pardo function convert the value from PCollection(org.apache.beam.sdk.values.Row) to PCollection (com.google.api.services.bigquery.model.TableRow)

Could some one to help me, using this Unnest transform to convert the unnest collection from nested collection.


回答1:


code for joining two Pcollection with nested structure in python with Beam:

with beam.Pipeline(options=option) as p:

    source_record1 =  p | "get data1" >> beam.io.avroio.ReadFromAvro(input_file1)
    source_record2 =  p | "get data2" >> beam.io.avroio.ReadFromAvro(input_file2)

    #convert into <k,v> form
    keyed_record1 = source_record1 | beam.ParDo(addkeysnested(),join_fileld_names1)
    keyed_record2 = source_record2 | beam.ParDo(addkeysnested(),join_fileld_names2)

    #Apply join operation
    rjoin = ({'File1Info': keyed_record1, 'File2Info': keyed_record2}                     
               | beam.CoGroupByKey())


    class addkeysnested(beam.DoFn):
        def process(self,element,fieldName):
            tmp_record = element    
            fieldName = fieldName.split(".")
            for i in range(len(fieldName)):

                if i != len(fieldName) - 1 :
                    tmp_record = tmp_record[fieldName[i].strip()][0]

                else:
                    tmp_record = tmp_record[fieldName[i].strip()]   

        return [(tmp_record,element)]

Note: In above code we can get keyvalue at any level of nested fields i.e. personalInfo.Address.City, After that apply CoGroupByKey() to join two pcollection



来源:https://stackoverflow.com/questions/55539922/how-to-unnest-the-nested-pcollection-in-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!