问题
I know I can output my spark dataframe to AWS S3 as a CSV file by
df.repartition(1).write.csv('s3://my-bucket-name/df_name')
My question is that is there an easy way to set the Access Control List (ACL) of this file to 'bucket-owner-full-control'
when writing it to S3 using pyspark?
回答1:
Don't know about the EMR s3 connector; in the ASF S3A connector you set the option fs.s3a.acl.default
when you open the connection: you can't set it on a file-by-file basis
回答2:
Access Control List (ACL) can be set via Hadoop Configuration after building spark session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('YourAppName').getOrCreate()
Set ACL as below:
spark.sparkContext.hadoopConfiguration().set('fs.s3.canned.acl', 'BucketOwnerFullControl')
Reference: s3 documentation
回答3:
Ran into the exact same issue. Spark job wrote files to a bucket that had server side encryption set to Access Denied. After reading some blogs, I learned that this can be solved by setting the fs.s3a.acl.default
parameter to BucketOwnerFullControl
.
Here is the code:
val spark =SparkSession.builder.appName().getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")
来源:https://stackoverflow.com/questions/52673924/how-to-assign-the-access-control-list-acl-when-writing-a-csv-file-to-aws-in-py