AWS Glue Crawler Classifies json file as UNKNOWN

问题

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.

I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.

Has anyone else run into this issue? Is there a better way to do this?

回答1:

I have two json files which are 42mb and 16mb, partitioned on S3 as path:

s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json

I had the same problem as you, crawler classification as UNKNOWN.

I were able to solved it:

You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
Run your new crawler with the data on S3 and proper schema will be created.
DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work

回答2:

As mentioned in

https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json

When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.

That is something which Dung also pointed out in his answer.

来源：https://stackoverflow.com/questions/46936721/aws-glue-crawler-classifies-json-file-as-unknown

标签

json

amazon-web-services

pyspark

aws-glue