Using Spark to write a parquet file to s3 over s3a is very slow

前端 未结 4 1015
闹比i
闹比i 2020-12-04 18:16

I\'m trying to write a parquet file out to Amazon S3 using Spark 1.6.1. The small parquet that I\'m generating is

4条回答
  •  情书的邮戳
    2020-12-04 18:49

    The direct output committer is gone from the spark codebase; you are to write your own/resurrect the deleted code in your own JAR. IF you do so, turn speculation off in your work, and know that other failures can cause problems too, where problem is "invalid data".

    On a brighter note, Hadoop 2.8 is going to add some S3A speedups specifically for reading optimised binary formats (ORC, Parquet) off S3; see HADOOP-11694 for details. And some people are working on using Amazon Dynamo for the consistent metadata store which should be able to do a robust O(1) commit at the end of work.

提交回复
热议问题