dynamically folder creation in s3 bucket from pyspark job

与世无争的帅哥 提交于 2021-01-29 09:01:37

问题


I am writing data into s3 bucket and creating parquet files using pyspark . MY bucket structure looks like below:

s3a://rootfolder/subfolder/table/

subfolder and table these two folders should be created at run time if folders do not exist , and if folders exist parquet files should inside folder table .

when I am running pyspark program from local machine it creates extra folder with _$folder$ (like table_$folder$ ) but if same program is run from emr it creates with _SUCCESS .

writing into s3: (pyspark program)
 data.write.parquet("s3a://rootfolder/sub_folder/table/", mode="overwrite")

is there way that creates only folder in s3 if do not exist and do not create folders like table_$folder$ or with _SUCCESS .


回答1:


s3a connector (org.apache.hadoop.fs.s3a.S3AFileSystem) doesn't create $folder$ files. It generates directory markers as path + /, . For example, mkdir s3a://bucket/a/b creates a zero bytes marker object /a/b/. This differentiates it from a file, which would have the path /a/b

  1. If, locally, you are using the s3n: URL. Stop it. use the S3a connector.
  2. If you have been setting the fs.s3a.impl option: stop it. hadoop knows what to use, and it uses the S3AFileSystem class
  3. If you are seeing them and you are running EMR, that's EMR's connector. Closed source, out of scope.



回答2:


Generally, as it was mentioned in the comments on s3 everything is either Bucket or Object: However, the folder structure is more a visual representation and not an actual hierarchy like in a traditional filesystem.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html
For this reason, you have to only create the Buckets and don't need to create the folders. It will only fail if the bucket+key combination already exists.

About the _$folder$ I'm not sure, I haven't seen those, it seems its created by Hadoop: https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
Junk Spark output file on S3 with dollar signs
How can I configure spark so that it creates "_$folder$" entries in S3?

About the _SUCCESS file: This basically indicates, that your job is completed successfully. Your can disable it with :

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")


来源:https://stackoverflow.com/questions/65125955/dynamically-folder-creation-in-s3-bucket-from-pyspark-job

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!