Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading 16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key 'foo.parquet/id=123/day=2016-11-26