Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this:

glueContext.write_dynamic_frame.from_options(frame = table,
                                         connection_type = "s3",
                                         connection_options = {"path": output_dir,
                                                               "partitionKeys": ["var1","var2"]},
                                         format = "parquet")

Is there anything like "mode":"overwrite" that replace my parquet files?

Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

table.toDF()
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("var_1", "var_2")
  .save(output_dir)

If you don't want your process to overwrite everything under "s3://bucket/table_name", you could use

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF()
    .write
    .mode("overwrite")
    .format("parquet")
    .partitionBy("date", "name")
    .save("s3://folder/<table_name>")

This will only update the "selected" partitions in that S3 location. In my case, I have 30 date-partitions in my DynamicFrame "data".

I'm using Glue 1.0 - Spark 2.4 - Python 2.

来源：https://stackoverflow.com/questions/52001781/overwrite-parquet-files-from-dynamic-frame-in-aws-glue

标签

amazon-web-services

parquet

aws-glue

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!