Error while staging packages when a Dataflow job is launched from a fat jar

眉间皱痕 提交于 2019-12-08 04:30:36

问题


I create a maven project to execute a pipeline. If I run the main class, the pipeline works perfectly. If I create a fat jar and I execute it, I have two different errors, one if I execute it under Windows and another one if I execute it under Linux.

Under Windows:

Exception in thread "main" java.lang.RuntimeException: Error while staging packages
    at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:364)
    at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:261)
    at org.apache.beam.runners.dataflow.util.GcsStager.stageFiles(GcsStager.java:66)
    at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:517)
    at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:170)
    at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
    at org.apache.beam.sdk.Pipeline.run(Pipeline.java:289)
    at ....
Caused by: java.nio.file.InvalidPathException: Illegal char <:> at index 2: gs://MY_BUCKET/staging
    at sun.nio.fs.WindowsPathParser.normalize(Unknown Source)
    at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
    at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
    at sun.nio.fs.WindowsPath.parse(Unknown Source)
    at sun.nio.fs.WindowsFileSystem.getPath(Unknown Source)
    at java.nio.file.Paths.get(Unknown Source)
    at org.apache.beam.sdk.io.LocalFileSystem.matchNewResource(LocalFileSystem.java:196)
    at org.apache.beam.sdk.io.LocalFileSystem.matchNewResource(LocalFileSystem.java:78)
    at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:563)
    at org.apache.beam.runners.dataflow.util.PackageUtil$PackageAttributes.forFileToStage(PackageUtil.java:452)
    at org.apache.beam.runners.dataflow.util.PackageUtil$1.call(PackageUtil.java:147)
    at org.apache.beam.runners.dataflow.util.PackageUtil$1.call(PackageUtil.java:138)
    at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
    at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
    at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Under Linux:

Exception in thread "main" java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
        at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:233)
        at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:162)
        at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:52)
        at org.apache.beam.sdk.Pipeline.create(Pipeline.java:142)
        at ....
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:222)
        ... 8 more
Caused by: java.lang.IllegalArgumentException: Expected a valid 'gs://' path but was given '/home/USER/gs:/MY_BUCKET/temp/staging/'
        at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.getGcsPath(GcsPathValidator.java:101)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPath(GcsPathValidator.java:75)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.validateOutputFilePrefixSupported(GcsPathValidator.java:60)
        at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:237)
        ... 13 more
Caused by: java.lang.IllegalArgumentException: Invalid GCS URI: /home/USER/gs:/MY_BUCKET/temp/staging/
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:191)
        at org.apache.beam.sdk.util.gcsfs.GcsPath.fromUri(GcsPath.java:116)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.getGcsPath(GcsPathValidator.java:99)
        ... 16 more

This is my pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>xxxxxxxxxxx</groupId>
  <artifactId>xxxxxxxxx</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <dependencies>
        <!-- https://mvnrepository.com/artifact/com.google.cloud.dataflow/google-cloud-dataflow-java-sdk-all -->
        <dependency>
            <groupId>com.google.cloud.dataflow</groupId>
            <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
          <groupId>com.fasterxml.jackson.core</groupId>
          <artifactId>jackson-core</artifactId>
          <version>2.9.3</version>
        </dependency>

        <dependency>
          <groupId>com.fasterxml.jackson.core</groupId>
          <artifactId>jackson-databind</artifactId>
          <version>2.9.3</version>
        </dependency>
        <dependency>
            <groupId>com.google.appengine</groupId>
            <artifactId>appengine-api-1.0-sdk</artifactId>
            <version>1.9.60</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-datastore</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>javax.servlet-api</artifactId>
            <version>4.0.0</version>
        </dependency>

    </dependencies>
    <build>
        <finalName>myFatJar</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.0.0</version>
                <configuration>
                    <transformers>
                        <transformer implementation= "org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                            <mainClass>com.myclass.MyClass</mainClass>
                        </transformer>
                    </transformers>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

        </plugins>
    </build>
</project>

and these are my pipeline options:

    ...
    DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create().as(DataflowPipelineOptions.class);
    //options.setGcpTempLocation("gs://MY_BUCKET/temp");
    options.setTempLocation("gs://MY_BUCKET/temp");
    options.setStagingLocation("gs://MY_BUCKET/staging");
    options.setProject("xxxxxxxx");
    options.setJobName("asd");
    options.setRunner(DataflowRunner.class);
    Pipeline.create(options);
    ...

I tried to change tempLocation with GcpTempLocation but, if I do, I have this error:

java.lang.IllegalArgumentException: BigQueryIO.Write needs a GCS temp location to store temp files.
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
        at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.validate(BatchLoads.java:191)
        at org.apache.beam.sdk.Pipeline$ValidateVisitor.enterCompositeTransform(Pipeline.java:621)
        at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:651)
        at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655)
        at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655)
        at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:311)
        at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:245)
        at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:446)
        at org.apache.beam.sdk.Pipeline.validate(Pipeline.java:563)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:302)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:289)
        at ...
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
        at java.lang.Thread.run(Thread.java:748)

What should I do?


回答1:


this comment resolves my question:

Did you try explicitly adding the Apache Beam artifact for DataflowRunner to pom.xml? – Andrew




回答2:


Adding a second answer here, that should solve this issue more broadly. I have a hunch that S.M.'s approach of pulling out the dependencies to the top level of the pom file above coincidentally got around the issue of not using the shade ServiceResourceTransformer in conjunction with the ManifestResourceTransformer.

However, without seeing the final pom file from S.M., I can't be certain.

Anyhow, I have included the shade build plugin configuration that worked for me:

<build>
    <pluginManagement>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.0.0</version>
                <executions>
                    <execution>
                        <id>generate-runner</id>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <finalName>${project.artifactId}${runner.suffix}</finalName>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/LICENSE</exclude>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>${runner.class}</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </pluginManagement>
</build>

Notes:

  • I used this in conjunction with:

    <dependency>
        <groupId>com.google.cloud.dataflow</groupId>
        <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
        <version>2.5.0</version>
    </dependency>
    
  • I picked up the exclusion arguments from: https://github.com/GoogleCloudPlatform/DataflowSDK-examples/blob/master/java/examples-java8/pom.xml

  • I had it work on both Windows and Linux



来源:https://stackoverflow.com/questions/48689089/error-while-staging-packages-when-a-dataflow-job-is-launched-from-a-fat-jar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!