I have been trying to fix this for a long time now ... no idea why I get this? FYI, I\'m running Spark on a cluster on AWS EMR Cluster. I debugged and clearly see the destin
I have seen a similar problem when writing parquet files to S3. The problem is the SaveMode.Overwrite. This mode doesn't seem to work correctly in combination with S3. Try to delete all the data in your S3 bucket my-bucket-name before writing into it. Then your code should run successfully.
To delete all files from your bucket my-bucket-name you can use the following pyspark code:
# see https://www.quora.com/How-do-you-overwrite-the-output-directory-when-using-PySpark
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
# see http://crazyslate.com/how-to-rename-hadoop-files-using-wildcards-while-patterns/
fs = FileSystem.get(URI("s3a://my-bucket-name"), sc._jsc.hadoopConfiguration())
file_status = fs.globStatus(Path("/*"))
for status in file_status:
fs.delete(status.getPath(), True)