Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I\'m newbee to use dataflow. How to get filename when use file patten match,
If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).
If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.
Some workarounds include:
DoFn readFile = ...(takes a filename, reads the file and produces records)...
p.apply(Create.of(filenames))
.apply(ParDo.of(readFile))
.apply(the rest of your pipeline)
This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.
FileBasedSource (javadoc, general documentation) which would return records of type something like Pair where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.