How to Get Filename when using file pattern match in google-cloud-dataflow

后端 未结 5 1464
无人共我
无人共我 2020-12-06 14:10

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?

I\'m newbee to use dataflow. How to get filename when use file patten match,

5条回答
  •  感动是毒
    2020-12-06 14:43

    I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to @danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.

    GcsOptions gcsOptions = options.as(GcsOptions.class);
    List paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
    ListfilesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
    
    PCollectionList pcl = PCollectionList.empty(p);
    for(String fileName : filesToProcess) {
        pcl = pcl.and(
                p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
                        .from(fileName)
                        .withSchema(SomeClass.class)
                )
                .apply(ParDo.of(new MyDoFn(fileName)))
        );
    }
    
    // flatten the PCollectionList, combining all the PCollections together
    PCollection flattenedPCollection = pcl.apply(Flatten.pCollections());
    

提交回复
热议问题