I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this:
glueContext.write_dynamic_frame.from_options(frame = table,
connection_type = "s3",
connection_options = {"path": output_dir,
"partitionKeys": ["var1","var2"]},
format = "parquet")
Is there anything like "mode":"overwrite"
that replace my parquet files?
Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.
As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:
table.toDF()
.write
.mode("overwrite")
.format("parquet")
.partitionBy("var_1", "var_2")
.save(output_dir)
If you don't want your process to overwrite everything under "s3://bucket/table_name", you could use
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF()
.write
.mode("overwrite")
.format("parquet")
.partitionBy("date", "name")
.save("s3://folder/<table_name>")
This will only update the "selected" partitions in that S3 location. In my case, I have 30 date-partitions in my DynamicFrame "data".
I'm using Glue 1.0 - Spark 2.4 - Python 2.
来源:https://stackoverflow.com/questions/52001781/overwrite-parquet-files-from-dynamic-frame-in-aws-glue