问题
I create a maven project to execute a pipeline. If I run the main class, the pipeline works perfectly. If I create a fat jar and I execute it, I have two different errors, one if I execute it under Windows and another one if I execute it under Linux.
Under Windows:
Exception in thread "main" java.lang.RuntimeException: Error while staging packages
at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:364)
at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:261)
at org.apache.beam.runners.dataflow.util.GcsStager.stageFiles(GcsStager.java:66)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:517)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:170)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:289)
at ....
Caused by: java.nio.file.InvalidPathException: Illegal char <:> at index 2: gs://MY_BUCKET/staging
at sun.nio.fs.WindowsPathParser.normalize(Unknown Source)
at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
at sun.nio.fs.WindowsPath.parse(Unknown Source)
at sun.nio.fs.WindowsFileSystem.getPath(Unknown Source)
at java.nio.file.Paths.get(Unknown Source)
at org.apache.beam.sdk.io.LocalFileSystem.matchNewResource(LocalFileSystem.java:196)
at org.apache.beam.sdk.io.LocalFileSystem.matchNewResource(LocalFileSystem.java:78)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:563)
at org.apache.beam.runners.dataflow.util.PackageUtil$PackageAttributes.forFileToStage(PackageUtil.java:452)
at org.apache.beam.runners.dataflow.util.PackageUtil$1.call(PackageUtil.java:147)
at org.apache.beam.runners.dataflow.util.PackageUtil$1.call(PackageUtil.java:138)
at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Under Linux:
Exception in thread "main" java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:233)
at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:142)
at ....
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:222)
... 8 more
Caused by: java.lang.IllegalArgumentException: Expected a valid 'gs://' path but was given '/home/USER/gs:/MY_BUCKET/temp/staging/'
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.getGcsPath(GcsPathValidator.java:101)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPath(GcsPathValidator.java:75)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.validateOutputFilePrefixSupported(GcsPathValidator.java:60)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:237)
... 13 more
Caused by: java.lang.IllegalArgumentException: Invalid GCS URI: /home/USER/gs:/MY_BUCKET/temp/staging/
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:191)
at org.apache.beam.sdk.util.gcsfs.GcsPath.fromUri(GcsPath.java:116)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.getGcsPath(GcsPathValidator.java:99)
... 16 more
This is my pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>xxxxxxxxxxx</groupId>
<artifactId>xxxxxxxxx</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/com.google.cloud.dataflow/google-cloud-dataflow-java-sdk-all -->
<dependency>
<groupId>com.google.cloud.dataflow</groupId>
<artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.9.3</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.9.3</version>
</dependency>
<dependency>
<groupId>com.google.appengine</groupId>
<artifactId>appengine-api-1.0-sdk</artifactId>
<version>1.9.60</version>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-datastore</artifactId>
<version>1.15.0</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>4.0.0</version>
</dependency>
</dependencies>
<build>
<finalName>myFatJar</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<transformers>
<transformer implementation= "org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.myclass.MyClass</mainClass>
</transformer>
</transformers>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
and these are my pipeline options:
...
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create().as(DataflowPipelineOptions.class);
//options.setGcpTempLocation("gs://MY_BUCKET/temp");
options.setTempLocation("gs://MY_BUCKET/temp");
options.setStagingLocation("gs://MY_BUCKET/staging");
options.setProject("xxxxxxxx");
options.setJobName("asd");
options.setRunner(DataflowRunner.class);
Pipeline.create(options);
...
I tried to change tempLocation with GcpTempLocation but, if I do, I have this error:
java.lang.IllegalArgumentException: BigQueryIO.Write needs a GCS temp location to store temp files.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.validate(BatchLoads.java:191)
at org.apache.beam.sdk.Pipeline$ValidateVisitor.enterCompositeTransform(Pipeline.java:621)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:651)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:311)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:245)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:446)
at org.apache.beam.sdk.Pipeline.validate(Pipeline.java:563)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:302)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:289)
at ...
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
What should I do?
回答1:
this comment resolves my question:
Did you try explicitly adding the Apache Beam artifact for DataflowRunner to pom.xml? – Andrew
回答2:
Adding a second answer here, that should solve this issue more broadly.
I have a hunch that S.M.'s approach of pulling out the dependencies to the top level of the pom file above coincidentally got around the issue of not using the shade ServiceResourceTransformer in conjunction with the ManifestResourceTransformer.
However, without seeing the final pom file from S.M., I can't be certain.
Anyhow, I have included the shade build plugin configuration that worked for me:
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<id>generate-runner</id>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<finalName>${project.artifactId}${runner.suffix}</finalName>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/LICENSE</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>${runner.class}</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
</build>
Notes:
I used this in conjunction with:
<dependency> <groupId>com.google.cloud.dataflow</groupId> <artifactId>google-cloud-dataflow-java-sdk-all</artifactId> <version>2.5.0</version> </dependency>I picked up the exclusion arguments from: https://github.com/GoogleCloudPlatform/DataflowSDK-examples/blob/master/java/examples-java8/pom.xml
I had it work on both Windows and Linux
来源:https://stackoverflow.com/questions/48689089/error-while-staging-packages-when-a-dataflow-job-is-launched-from-a-fat-jar