Watching for new files matching a filepattern in Apache Beam

前端 未结 2 1506
我在风中等你
我在风中等你 2020-11-30 13:05

I have a directory on GCS or another supported filesystem to which new files are being written by an external process.

I would like to write an Apache Beam streaming

2条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-11-30 13:53

    This is possible starting with Apache Beam 2.2.0. Several APIs support this use case:

    If you're using TextIO or AvroIO, they support this explicitly via TextIO.read().watchForNewFiles() and the same on readAll(), for example:

    PCollection lines = p.apply(TextIO.read()
        .from("gs://path/to/files/*")
        .watchForNewFiles(
            // Check for new files every 30 seconds
            Duration.standardSeconds(30),
            // Never stop checking for new files
            Watch.Growth.never()));
    

    If you're using a different file format, you may use FileIO.match().continuously() and FileIO.matchAll().continuously() which support the same API, in combination with FileIO.readMatches().

    The APIs support specifying how often to check for new files, and when to stop checking (supported conditions are e.g. "if no new output appears within a given time", "after observing N outputs", "after a given time since starting to check" and their combinations).

    Note that right now this feature currently works only in the Direct runner and the Dataflow runner, and only in the Java SDK. In general, it will work in any runner that supports Splittable DoFn (see capability matrix).

提交回复
热议问题