AWS Glue: How to handle nested JSON with varying schemas

后端 未结 5 1767
独厮守ぢ
独厮守ぢ 2021-01-31 11:29

Objective: We\'re hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse v

5条回答
  •  时光取名叫无心
    2021-01-31 11:53

    I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]

    import json
    
    # Your mapping function
    def flatten(rec):
        for key in rec:
            rec[key] = json.dumps(rec[key])
        return rec
    
    old_df = glueContext.create_dynamic_frame.from_options(
        's3',
        {"paths": ['s3://...']},
        "json")
    
    # Apply mapping function f to all DynamicRecords in DynamicFrame
    new_df = Map.apply(frame=old_df, f=flatten)
    

    From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.

提交回复
热议问题