问题
I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.
I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.
Has anyone else run into this issue? Is there a better way to do this?
回答1:
I have two json files which are 42mb and 16mb, partitioned on S3 as path:
s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json
I had the same problem as you, crawler classification as UNKNOWN.
I were able to solved it:
- You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
- Run your new crawler with the data on S3 and proper schema will be created.
- DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work
回答2:
As mentioned in
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.
That is something which Dung also pointed out in his answer.
来源:https://stackoverflow.com/questions/46936721/aws-glue-crawler-classifies-json-file-as-unknown