Querying optional nested JSON fields in Athena

问题

I have json data that looks something like:

{ "col1" : 123, "metadata" : { "opt1" : 456, "opt2" : 789 } }

where the various metadata fields (of which there are many) are optional and may or may not be present.

My query is:

select col1, metadata.opt1 from "db-name".tablename

If opt1 is not present in any rows, I would expect this to return all rows with a blank for the opt1 column, but if there wasn't a row with the opt1 in metadata when the crawler ran (and might still not be present in data when the query is run, as it's optional), the query fails, with:

SYNTAX_ERROR: line 2:1: Column '"metadata"."opt1"' cannot be resolved

I could specify these fields manually either in the schema definition (if I don't use a crawler), but then it wouldn't pick up any new metadata fields that may arrive, and specifying a static schema doesn't seem to be in the spirit of how Athena is supposed to work.

How do I get this to function as expected (preferably without putting dummy rows in or customizing the SerDe)?

Using SerDe org.openx.data.jsonserde.JsonSerDe at present.

Thanks for any ideas.

回答1:

It might not be what you want to hear, but I advise you to not use Glue Crawler. This is just the tip of the iceberg of the problems it creates when your use case doesn't fit exactly with the use cases it was designed for (see for example this question, this question, this question, this question, or this question).

Instead, create the table manually using whatever Glue Crawler created for you when it worked (you can get the DDL for a table with SHOW CREATE TABLE foo in Athena). Then add partitions manually with ALTER TABLE foo ADD PARTITION.

Keeping the table up to date with optional fields is going to be complicated, whatever method you use. If you only ever add you can update the table's columns when you add a new partition that has more columns (if you do it with Athena do it before you add the partition), but another way would be to simply type the metadata column as STRING and use JSON functions to extract the properties in your queries (see for example this question/answer).

I assume you're using Glue Crawler to add partitions periodically. If you're in control of the process that adds data to S3 I suggest you add code there that runs an ALTER TABLE … ADD PARTITION (or uses CreatePartition in the Glue API.

If you're not in control of that code, or it would be very inconvenient, you can solve the problem with Lambda. If you, for example, only partition by time, you can run it once per day and add the next day's partition (there doesn't have to be any data on S3, you can add partitions that don't yet contain data, it's just metadata). If it's more complex than that you can trigger the Lambda function to run when new files are created on S3 and add the partitions as a reaction.

This might sound more complicated than using Glue Crawler, and if Glue Crawlers actually worked as you expect them to it would be. Since they don't really work very well, it's going to be a lot less work.

回答2:

You can use try as a work around for your problem.

    select col1, try(metadata.opt1) from "db-name".tablename

来源：https://stackoverflow.com/questions/61297671/querying-optional-nested-json-fields-in-athena

标签

json

amazon-web-services

aws-glue

amazon-athena