AWS Glue: How to handle nested JSON with varying schemas

后端未结

关注

 5  1767

独厮守ぢ 2021-01-31 11:29

Objective: We\'re hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse v

5条回答

时光取名叫无心 (楼主)

2021-01-31 11:53
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
```
import json

# Your mapping function
def flatten(rec):
    for key in rec:
        rec[key] = json.dumps(rec[key])
    return rec

old_df = glueContext.create_dynamic_frame.from_options(
    's3',
    {"paths": ['s3://...']},
    "json")

# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
```
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...