NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities while reading s3 Data with spark

匿名 (未验证) 提交于 2019-12-03 01:13:01

问题:

I would like to run a simple spark job on my local dev machine (through Intellij) reading data from Amazon s3.

my build.sbt file:

scalaVersion := "2.11.12"  libraryDependencies ++= Seq(   "org.apache.spark" %% "spark-core" % "2.3.1",   "org.apache.spark" %% "spark-sql" % "2.3.1",   "com.amazonaws" % "aws-java-sdk" % "1.11.407",   "org.apache.hadoop" % "hadoop-aws" % "3.1.1" ) 

my code snippet:

val spark = SparkSession     .builder     .appName("test")     .master("local[2]")     .getOrCreate()    spark     .sparkContext     .hadoopConfiguration     .set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")    val schema_p = ...    val df = spark     .read     .schema(schema_p)     .parquet("s3a:///...") 

And I get the following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities     at java.lang.ClassLoader.defineClass1(Native Method)     at java.lang.ClassLoader.defineClass(ClassLoader.java:763)     at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)     at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)     at java.net.URLClassLoader.access$100(URLClassLoader.java:73)     at java.net.URLClassLoader$1.run(URLClassLoader.java:368)     at java.net.URLClassLoader$1.run(URLClassLoader.java:362)     at java.security.AccessController.doPrivileged(Native Method)     at java.net.URLClassLoader.findClass(URLClassLoader.java:361)     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)     at java.lang.Class.forName0(Native Method)     at java.lang.Class.forName(Class.java:348)     at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2093)     at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2058)     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2152)     at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)     at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)     at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)     at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)     at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)     at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)     at Test$.delayedEndpoint$Test$1(Test.scala:27)     at Test$delayedInit$body.apply(Test.scala:4)     at scala.Function0$class.apply$mcV$sp(Function0.scala:34)     at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)     at scala.App$$anonfun$main$1.apply(App.scala:76)     at scala.App$$anonfun$main$1.apply(App.scala:76)     at scala.collection.immutable.List.foreach(List.scala:392)     at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)     at scala.App$class.main(App.scala:76)     at Test$.main(Test.scala:4)     at Test.main(Test.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StreamCapabilities     at java.net.URLClassLoader.findClass(URLClassLoader.java:381)     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)     ... 41 more 

When replacing s3a:/// to s3:/// I get another error: No FileSystem for scheme: s3

As I am new to AWS, I do not know if I should user s3:///, s3a:/// or s3n:///. I have already setup my AWS credentials with aws-cli.

I have not any Spark installation on my machine.

Thanks in advance for your help

回答1:

I would start by looking at the S3A troubleshooting docs

Do not attempt to “drop in” a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.

whatever version of the hadoop- JARs you have on your local spark installation, you need to have exactly the same version of hadoop-aws, and exactly the same version of the aws SDK which hadoop-aws was built with. Try mvnrepository for the details.



回答2:

For me, it got solved by adding the following dependency in pom.xml besides the above:

<dependency>     <groupId>org.apache.hadoop</groupId>     <artifactId>hadoop-common</artifactId>     <version>3.1.1</version> </dependency> 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!