Junk Spark output file on S3 with dollar signs

谁说胖子不能爱 提交于 2019-12-25 07:59:51

问题


I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.

What is it? How I can prevent spark from creating it? Here is some code to show what I am doing...

x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")

After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.


回答1:


Ok, it seems I found out what it is. It is some kind of marker file, probably used for determining if the S3 directory object exists or not. How I reached this conclusion? First, I found this link that shows the source of

org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir

method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html

Then I googled other source repositories to see if I am going to find different version of the method. I didn't.

At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.

My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.

All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.




回答2:


Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.



来源:https://stackoverflow.com/questions/40041732/junk-spark-output-file-on-s3-with-dollar-signs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!