How to ignore amazon athena struct order

荒凉一梦 提交于 2019-12-02 10:14:18

I suggest you stop using Glue crawlers. It's probably not the response you had hoped for, but crawlers are really bad at their job. They can be useful sometimes as a way to get a schema from a random heap of data that someone else produced and that you don't want to spend time looking at to figure out what its schema is – but once you have a schema, and you know that new data will follow that schema, Glue crawlers are just in the way, and produce unnecessary problems like the one you have encountered.

What to do instead depends on how new data is added to S3.

If you are in control of the code that produces the data, you can add code that adds partitions after the data has been uploaded. The benefit of this solution is that partitions are added immediately after new data has been produced so tables are always up to date. However, it might tightly couple the data producing code with Glue (or Athena if you prefer to add partitions through SQL) in a way that is not desirable.

If it doesn't make sense to add the partitions from the code that produces the data, you can create a Lambda function that does it. You can either set it to run at a fixed time every day (if you know the location of the new data you don't have to wait until it exists, partitions can point to empty locations), or you can trigger it by S3 notifications (if there are multiple files you can either figure out a way to debounce the notifications through SQS or just create the partition over and over again, just swallow the error if the partition already exists).

You may also have heard of MSCK REPAIR TABLE …. It's better than Glue crawlers in some ways, but just as bad in other ways. It will only add new partitions, never change the schema, which is usually what you want, but it's extremely inefficient, and runs slower and slower the more files there are. Kind of like Glue crawlers.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!