Multi-line JSON file querying in hive

余生颓废 提交于 2020-01-15 04:14:48

问题


I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.

I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).

  1. Is there a SerDe format out there that is able to parse multi-line indented .json files?
  2. If there isn't a SerDe format to do this:
    • Is there a best practice for dealing with files like this?
      • Should I plan on flattening these records out using a different tool like python?
    • Is there a standard way of writing custom SerDe formats, so I can write one myself?

Example file body:

[
  {
    "id": 1,
    "name": "ryan",
    "stuff: {
      "x": true,
      "y": [
        123,
        456
      ]
    },
  },
  ...
]

回答1:


There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.

You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.



来源:https://stackoverflow.com/questions/54466526/multi-line-json-file-querying-in-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!