Running Apache Beam pipeline in Spring Boot project on Google Data Flow

依然范特西╮ 提交于 2021-02-19 08:27:34

问题


I'm trying the run an Apache Beam pipeline in a Spring Boot project on Google Data Flow, but I keep having this error Failed to construct instance from factory method DataflowRunner#fromOptions(interfaceorg.apache.beam.sdk.options.PipelineOptions

The example I'm trying to run is a basic word count provided by the official documentation, https://beam.apache.org/get-started/wordcount-example/ . The problem is that this example is using different classes for each example, and each example has his own main function, but what I'm tried to do is run the example in a spring boot project with a class that implements the CommandLineRunner.

Spring boot main class :

 @SpringBootApplication
  public class BeamApplication {
public static void main(String[] args) {
    SpringApplication.run(BeamApplication.class, args);
}}  

CommandLineRunner:

@Component
public class Runner implements CommandLineRunner {
@Override
public void run(String[] args) throws Exception {

    WordCountOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(WordCountOptions.class);
    runWordCount(options);
}

static void runWordCount(WordCountOptions options) throws InterruptedException {

    Pipeline p = Pipeline.create(options);

    p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
            .apply(new CountWords())
            .apply(MapElements.via(new FormatAsTextFn()))
            .apply("WriteCounts", TextIO.write().to(options.getOutput()));
    p.run().waitUntilFinish();
}}

Wordcount Option:

public interface WordCountOptions extends PipelineOptions {

@Description("Path of the file to read from")
@Default.String("./src/main/resources/input.txt")
String getInputFile();
void setInputFile(String value);

@Description("path of output file")
// @Validation.Required
// @Default.String("./target/ts_output/extracted_words")
@Default.String("Path of the file to write to")
String getOutput();
void setOutput(String value);
}

Extaract words:

public class ExtractWordsFn extends DoFn<String, String> {
   public static final String TOKENIZER_PATTERN = "[^\\p{L}]+";

@ProcessElement
public void processElement(ProcessContext c) {
    for (String word : c.element().split(TOKENIZER_PATTERN)) {
        if (!word.isEmpty()) {
            c.output(word);
        }}}}

CountWords:

public  class CountWords extends    PTransform<PCollection<String>,PCollection<KV<String, Long>>> {

@Override
public PCollection<KV<String, Long>> expand(PCollection<String> lines){
    // Convert lines of text into individual words.
    PCollection<String> words = lines.apply(
            ParDo.of(new ExtractWordsFn()));

    // Count the number of times each word occurs.
    PCollection<KV<String, Long>> wordCounts =
            words.apply(Count.perElement());

    return wordCounts;
}}

When I use the Direct runner, the project works as expected and generated files in the root directory of the project,but when I try the use the Google Data Flow runner by passing these arguments --runner=DataflowRunner --project=datalake-ng --stagingLocation=gs://data_transformer/staging/ --output=gs://data_transformer/output (when using java -jar or Intellij). i get the error mentioned in the beginning of my post.

I'm using Java 11, and after looking at this Failed to construct instance from factory method DataflowRunner#fromOptions in beamSql, apache beam I tried to take my code into a fresh Java 8 Spring boot project,but the error remained the same.

When Running the project provided by the Apache beam documentation (classes with different mains), it works fine on Google Data flow and I can see the generated output in Google bucket. and my WordCountOptions interface is the same as the one provided by the official documentation.

Could the issue be caused by the CommandLineRunner ? I though that the arguments are not being received by the app, but when i debugged this line,

WordCountOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(WordCountOptions.class); 

The variable options has the right values,which are --runner=DataflowRunner --project=target-datalake-ng --stagingLocation=gs://data_transformer/staging/ --output=gs://data_transformer/output .

EDIT:

I found out that the cause of the error is a problem with gcloud authentification and the access to Google cloud bucket (Anonymous caller does not have storage.buckets.list access to project 961543751). I double checked the access and it's supposed to be set correctly since it works fine on the Beam example default project. I revoked all access and set it up again but the issue remains. i took a look at these https://github.com/googleapis/google-cloud-node/issues/2456 https://github.com/googleapis/google-cloud-ruby/issues/1588 , and I'm still trying to identify the issue, but for now it seem like a dependency version problem.

来源:https://stackoverflow.com/questions/57356506/running-apache-beam-pipeline-in-spring-boot-project-on-google-data-flow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!