AWS Glue Bookmark produces duplicates

心不动则不痛 提交于 2019-12-11 18:33:16

问题


I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source.

These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data.

Unfortunately in this scenario I notice instead that each time duplicates are produced and looks like that AWS Glue bookmarking is not working at all. What's the reason of this unexpected behaviour?


回答1:


Can you please check now. It supports Parquet and ORC. But Version 1.0 and later. Version Version 0.9, it was not supporting

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html




回答2:


From https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

The Apache Parquet and ORC formats are currently not supported.

UPDATE

Since Jul 26 2019 AWS Glue supports Parquet and ORC formats as well for bookmarking

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html



来源:https://stackoverflow.com/questions/55374622/aws-glue-bookmark-produces-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!