AWS Glue Crawler adding tables for every partition?

我是研究僧i 提交于 2020-01-31 08:28:16

问题


I have several thousand files in an S3 bucket in this form:

├── bucket
│   ├── somedata
│   │   ├── year=2016
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── sometype-2017-11-01.parquet
│   |   |   |   ├── sometype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   │   ├── month=12
│   │   |   │   ├── sometype-2017-12-01.parquet
│   |   |   |   ├── sometype-2017-12-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=2018
│   │   │   ├── month=01
│   │   |   │   ├── sometype-2018-01-01.parquet
│   |   |   |   ├── sometype-2018-01-02.parquet
│   |   |   |   ├── ...
│   ├── moredata
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── moretype-2017-11-01.parquet
│   |   |   |   ├── moretype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=...

etc

Expected behavior: The AWS Glue Crawler creates one table for each of somedata, moredata, etc. It creates partitions for each table based on the childrens' path names.

Actual Behavior: The AWS Glue Crawler performs the behavior above, but ALSO creates a separate table for every partition of the data, resulting in several hundred extraneous tables (and more extraneous tables which every data add + new crawl).

I see no place to be able to set something or otherwise prevent this from happening... Does anyone have advice on the best way to prevent these unnecessary tables from being created?


回答1:


Adding to the excludes

  • **_SUCCESS
  • **crc

worked for me (see aws page glue/add-crawler). Double stars match the files at all folder (ie partition) depths. I had an _SUCCESS living a few levels up.

Make sure you set up logging for glue, which quickly points out permission errors etc.




回答2:


I was having the same problem. I added *crc* as exclude pattern to the AWS Glue crawler and it worked. Or if you crawl entire directories add */*crc*.




回答3:


check if you have empty folders inside. When spark writes to S3, sometimes, the _temporary folder is not deleted, which will make Glue crawler create table for each partition.




回答4:


You need to have separate crawlers for each table / file type. So create one crawler that looks at s3://bucket/somedata/ and a 2nd crawler that looks at s3://bucket/moredata/.




回答5:


So, my case was a little bit different and I was having the same behaviour.

I got a data structure like this:

├── bucket
│   ├── somedata
│   │   ├── event_date=2016-01-01
│   │   ├── event_date=2016-01-02

So when I started AWS Glue Crawler instead of update the tables, this pipeline was creating a one table per date. After digging into the problem I found that someone added a column as a bug at the json file instead of id was ID. Because my data is parquet the pipeline was working well to store the data and retrieve inside the EMR. But Glue was crashing pretty bad because Glue convert everything to lowercase and probably that was the reason why it was crashing. Removing the uppercase column glue start to work like a charm.



来源:https://stackoverflow.com/questions/48373084/aws-glue-crawler-adding-tables-for-every-partition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!