How can I improve performance of TextIO or AvroIO when reading a very large number of files?

前端 未结 1 758
遥遥无期
遥遥无期 2020-12-19 10:29

TextIO.read() and AvroIO.read() (as well as some other Beam IO\'s) by default don\'t perform very well in current Apache Beam runners when reading

相关标签:
1条回答
  • 2020-12-19 11:12

    When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles(), which is currently implemented on TextIO and AvroIO.

    For example:

    PCollection<String> lines = p.apply(TextIO.read()
        .from("gs://some-bucket/many/files/*")
        .withHintMatchesManyFiles());
    

    Using this hint causes the transforms to execute in a way optimized for reading a large number of files: the number of files that can be read in this case is practically unlimited, and most likely the pipeline will run faster, cheaper and more reliably than without this hint.

    However, it may perform worse than without the hint if the filepattern actually matches only a small number of files (for example, a few dozen or a few hundred files).

    Under the hood, this hint causes the transforms to execute via respectively TextIO.readAll() or AvroIO.readAll(), which are more flexible and scalable versions of read() that allow reading a PCollection<String> of filepatterns (where each String is a filepattern), with the same caveat: if the total number of files matching the filepatterns is small, they may perform worse than a simple read() with the filepattern specified at pipeline construction time.

    0 讨论(0)
提交回复
热议问题