How to Get Filename when using file pattern match in google-cloud-dataflow

后端 未结 5 1457
无人共我
无人共我 2020-12-06 14:10

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?

I\'m newbee to use dataflow. How to get filename when use file patten match,

5条回答
  •  一向
    一向 (楼主)
    2020-12-06 14:47

    One approach is to build a List where each entry corresponds to an input file, then use Flatten. For example, if you want to parse each line of a collection of files into a Foo object, you might do something like this:

    public static class FooParserFn extends DoFn {
      private String fileName;
      public FooParserFn(String fileName) {
        this.fileName = fileName;
      }
    
      @Override
      public void processElement(ProcessContext processContext) throws Exception {
        String line = processContext.element();
        // here you have access to both the line of text and the name of the file
        // from which it came.
      }
    }
    
    public static void main(String[] args) {
      ...
      List inputFiles = ...;
      List> foosByFile =
              Lists.transform(inputFiles,
              new Function>() {
                @Override
                public PCollection apply(String fileName) {
                  return p.apply(TextIO.Read.from(fileName))
                          .apply(new ParDo().of(new FooParserFn(fileName)));
                }
              });
    
      PCollection foos = PCollectionList.empty(p).and(foosByFile).apply(Flatten.pCollections());
      ...
    }
    

    One downside of this approach is that, if you have 100 input files, you'll also have 100 nodes in the Cloud Dataflow monitoring console. This makes it hard to tell what's going on. I'd be interested in hearing from the Google Cloud Dataflow people whether this approach is efficient.

提交回复
热议问题