AWS Glue Crawler Classifies json file as UNKNOWN

折月煮酒 提交于 2020-01-11 02:49:26

问题


I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.

I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.

Has anyone else run into this issue? Is there a better way to do this?


回答1:


I have two json files which are 42mb and 16mb, partitioned on S3 as path:

  • s3://bucket/stg/year/month/_0.json

  • s3://bucket/stg/year/month/_1.json

I had the same problem as you, crawler classification as UNKNOWN.

I were able to solved it:

  • You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
  • Run your new crawler with the data on S3 and proper schema will be created.
  • DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work



回答2:


As mentioned in

https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json

When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.

That is something which Dung also pointed out in his answer.



来源:https://stackoverflow.com/questions/46936721/aws-glue-crawler-classifies-json-file-as-unknown

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!