How to use s3 with Apache spark 2.2 in the Spark shell

匿名 (未验证) 提交于 2019-12-03 02:18:01

问题:

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.

I have consulted the following resources:

Parsing files from Amazon S3 with Apache Spark

How to access s3a:// files from Apache Spark?

Hortonworks Spark 1.6 and S3

Cloudera

Custom s3 endpoints

I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.access.key=access-key  spark.hadoop.fs.s3a.secret.key=secret-key 

I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:

bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar 

In the shell, here is how I try to load data from the S3 bucket:

val p = spark.read.textFile("s3a://sparkcookbook/person") 

And here is the error that results:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider   at java.lang.Class.forName0(Native Method)   at java.lang.Class.forName(Class.java:348)   at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)   at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)   at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)   at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)   at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)   at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) 

When I instead try to start the Spark shell as follows:

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1 

Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:

:: problems summary :: :::: ERRORS     unknown resolver null      unknown resolver null      unknown resolver null      unknown resolver null      unknown resolver null      unknown resolver null   :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS 

And here is the second:

val p = spark.read.textFile("s3a://sparkcookbook/person") java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation   at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)   at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)   at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)   at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)   at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)   at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)   at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)   at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)   at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)   at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515) 

Could someone suggest how to get this working? Thanks.

回答1:

If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.

$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar 

After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!