Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I\'m newbee to use dataflow. How to get filename when use file patten match,
I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to @danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.
GcsOptions gcsOptions = options.as(GcsOptions.class);
List paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
ListfilesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
PCollectionList pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
pcl = pcl.and(
p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
.from(fileName)
.withSchema(SomeClass.class)
)
.apply(ParDo.of(new MyDoFn(fileName)))
);
}
// flatten the PCollectionList, combining all the PCollections together
PCollection flattenedPCollection = pcl.apply(Flatten.pCollections());