Save a large Spark Dataframe as a single json file in S3

问题

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1).save("s3n://mybucket/testfile","json")

But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.

Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?

Btw i need the data in a single file because another user is going to download it after.

*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.

Thanks a lot

回答1:

I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

df.write.mode('append').json(yourtargetpath)

回答2:

Try this

dataframe.write.format("org.apache.spark.sql.json").mode(SaveMode.Append).save("hdfs://localhost:9000/sampletext.txt");

回答3:

s3a is not production version in Spark I think. I would say the design is not sound. repartition(1) is going to be terrible (what you are telling spark is to merge all partitions to a single one). I would suggest to convince the downstream to download contents from a folder rather than a single file

来源：https://stackoverflow.com/questions/29908892/save-a-large-spark-dataframe-as-a-single-json-file-in-s3

标签

apache-spark

dataframe

apache-spark-sql

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!