发表新帖

发表新帖

Connect to S3 data from PySpark

后端未结

关注

 2  1446

花落未央 2020-12-06 06:13

I am trying to read a JSON file, from Amazon s3, to create a spark context and use it to process the data.

Spark is basically in a docker container. So putting file

2条回答

無奈伤痛 (楼主)

2020-12-06 06:35
I've solved adding --packages org.apache.hadoop:hadoop-aws:2.7.1 into spark-submit command.

It will download all hadoop missing packages that will allow you to execute spark jobs with S3.

Then in your job you need to set your AWS credentials like:
```
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_key)
```
Other option about setting your credentials is define them into spark/conf/spark-env:
```
#!/usr/bin/env bash
AWS_ACCESS_KEY_ID='xxxx'
AWS_SECRET_ACCESS_KEY='xxxx'

SPARK_WORKER_CORES=1 # to set the number of cores to use on this machine
SPARK_WORKER_MEMORY=1g # to set how much total memory workers have to give executors (e.g. 1000m, 2g)
SPARK_EXECUTOR_INSTANCES=10 #, to set the number of worker processes per node
```
More info:
- How to Run PySpark on AWS
- AWS Credentials
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题