I\'m trying to write a parquet file out to Amazon S3 using Spark 1.6.1. The small parquet that I\'m generating is
The direct output committer is gone from the spark codebase; you are to write your own/resurrect the deleted code in your own JAR. IF you do so, turn speculation off in your work, and know that other failures can cause problems too, where problem is "invalid data".
On a brighter note, Hadoop 2.8 is going to add some S3A speedups specifically for reading optimised binary formats (ORC, Parquet) off S3; see HADOOP-11694 for details. And some people are working on using Amazon Dynamo for the consistent metadata store which should be able to do a robust O(1) commit at the end of work.