I am passing input and output folders as parameters to mapreduce word count program from webpage.
Getting below error:
HTTP Status 500 - Request processing failed; nested exception is java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
The documentation has the format: http://wiki.apache.org/hadoop/AmazonS3
s3n://ID:SECRET@BUCKET/Path
I suggest you use this:
hadoop distcp \
-Dfs.s3n.awsAccessKeyId=<your_access_id> \
-Dfs.s3n.awsSecretAccessKey=<your_access_key> \
s3n://origin hdfs://destinations
It also works as a workaround for the occurrence of slashes in the key. The parameters with the id and access key must be supplied exactly in this order: after disctcp and before origin
Passing in the AWS Credentials as part of the Amazon s3n url is not normally recommended, security wise. Especially if that code is pushed to a repository holding service (like github). Ideally set your credentials in the conf/core-site.xml as:
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>XXXXXX</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>XXXXXX</value>
</property>
</configuration>
or reinstall awscli on your machine.
pip install awscli
For pyspark beginner:
Prepare
Download jar from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
, put this to spark jars folder
Then you can
1. Hadoop config file
core-site.xml
export AWS_ACCESS_KEY_ID=<access-key>
export AWS_SECRET_ACCESS_KEY=<secret-key>
<configuration>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
</configuration>
2. pyspark config
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")
Example
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
if __name__ == "__main__":
"""
Usage: S3 sample
"""
access_key = '<access-key>'
secret_key = '<secret-key>'
spark = SparkSession\
.builder\
.appName("Demo")\
.getOrCreate()
sc = spark.sparkContext
# remove this block if use core-site.xml and env variable
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")
# fetch from s3, returns RDD
csv_rdd = spark.sparkContext.textFile("s3n://<bucket-name>/path/to/file.csv")
c = csv_rdd.count()
print("~~~~~~~~~~~~~~~~~~~~~count~~~~~~~~~~~~~~~~~~~~~")
print(c)
spark.stop()
来源:https://stackoverflow.com/questions/24924808/how-to-specify-aws-access-key-id-and-secret-access-key-as-part-of-a-amazon-s3n-u