问题
I am writing data into s3 bucket and creating parquet files using pyspark . MY bucket structure looks like below:
s3a://rootfolder/subfolder/table/
subfolder and table these two folders should be created at run time if folders do not exist , and if folders exist parquet files should inside folder table .
when I am running pyspark program from local machine it creates extra folder with _$folder$ (like table_$folder$
) but if same program is run from emr it creates with _SUCCESS .
writing into s3: (pyspark program)
data.write.parquet("s3a://rootfolder/sub_folder/table/", mode="overwrite")
is there way that creates only folder in s3 if do not exist and do not create folders like table_$folder$ or with _SUCCESS .
回答1:
s3a connector (org.apache.hadoop.fs.s3a.S3AFileSystem
) doesn't create $folder$
files. It generates directory markers as path + /, . For example, mkdir s3a://bucket/a/b
creates a zero bytes marker object /a/b/
. This differentiates it from a file, which would have the path /a/b
- If, locally, you are using the
s3n
: URL. Stop it. use the S3a connector. - If you have been setting the
fs.s3a.impl
option: stop it. hadoop knows what to use, and it uses the S3AFileSystem class - If you are seeing them and you are running EMR, that's EMR's connector. Closed source, out of scope.
回答2:
Generally, as it was mentioned in the comments on s3 everything is either Bucket or Object:
However, the folder structure is more a visual representation and not an actual hierarchy like in a traditional filesystem.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html
For this reason, you have to only create the Buckets and don't need to create the folders. It will only fail if the bucket+key combination already exists.
About the _$folder$ I'm not sure, I haven't seen those, it seems its created by Hadoop:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
Junk Spark output file on S3 with dollar signs
How can I configure spark so that it creates "_$folder$" entries in S3?
About the _SUCCESS file: This basically indicates, that your job is completed successfully. Your can disable it with :
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
来源:https://stackoverflow.com/questions/65125955/dynamically-folder-creation-in-s3-bucket-from-pyspark-job