Discrepancy in running Apache Beam Data Generator with DirectRunner and FlinkRunner

大憨熊 提交于 2020-01-06 07:12:42

问题


This question is related to my earlier post about benchmarking Apache Beam with an on-the-fly data generator.

I have the following code to generate data within my pipeline:

PCollection<Long> data = pipeline.apply(GenerateSequence.from(1)
                         .withMaxReadTime(Duration.millis(3000)));

//Print generated data
data.apply(ParDo.of(new DoFn<Long, String>() {
  @ProcessElement
  public void processElement(@Element Long input) {
    System.out.println(input);
  }
}));

pipeline.run();

If I run this code with DirectRunner (--runner=direct) I don't see the generated values on my console.

If I run the same code with FlinkRunner (--runner=FlinkRunner) I can see the generated data in the console output like the following

4106
4109
4083
.
.
.

Another issue is that, even though I declare the maximum read time as 3 seconds, the generator never stops!

If I simply omit the max read time from the code and run only the following:

PCollection<Long> data = pipeline.apply(GenerateSequence.from(1));

//Print generated data
data.apply(ParDo.of(new DoFn<Long, String>() {
  @ProcessElement
  public void processElement(@Element Long input) {
    System.out.println(input);
  }
}));

pipeline.run();

Both DirectRunner and FlinkRunner can output the generated values to the console without issues.

Does anyone have an idea why I could be facing this issue?

来源:https://stackoverflow.com/questions/56112392/discrepancy-in-running-apache-beam-data-generator-with-directrunner-and-flinkrun

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!