Load complex json in hive using jsonserde

*爱你&永不变心* 提交于 2019-12-08 14:10:40

问题


I am trying to build a table in hive for following json

{
    "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
    "hours": {
        "Tuesday": {
            "close": "17:00",
            "open": "08:00"
        },
        "Friday": {
            "close": "17:00",
            "open": "08:00"
        }
    },
    "open": true,
    "categories": [
        "Doctors",
        "Health & Medical"
    ],
    "review_count": 9,
    "name": "Eric Goldberg, MD",
    "neighborhoods": [],
    "attributes": {
        "By Appointment Only": true,
        "Accepts Credit Cards": true, 
        "Good For Groups": 1
    },
    "type": "business"
}

I can create a table using following DDL,however I get an exception while querying that table.

CREATE TABLE IF NOT EXISTS business (
 business_id string,
 hours map<string,string>,
 open boolean,
 categories array<string>,
 review_count int,
 name string,
 neighborhoods array<string>,
 attributes map<string,string>,
 type string
 )
 ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';

The exception while retrieving data is "ClassCast:Cant cast jsoanarray to json object" . What is the correct schema for this json? Is there any took which can help me generate correct schema for given json to be used with jsonserde?


回答1:


It looks to me that the problem is hours which you defined as hours map<string,string> but should be a map<string,map<string,string> instead.

There's a tool you can use to generate the hive table definition automatically from your JSON data: https://github.com/quux00/hive-json-schema

but you may want to adjust it because when encountering a JSON Object (Anything between {} ) the tool can't know wether to translate it to a hive map or to a struct. On your data, the tool gives me this:

CREATE TABLE x (
 attributes struct<accepts credit cards:boolean, 
       by appointment only:boolean, good for groups:int>,
 business_id string,
 categories array<string>,
 hours map<string:struct<close:string, open:string>
 name string,
 neighborhoods array<string>,
 open boolean,
 review_count int,
 type string
)

but it looks like you want something like this:

CREATE TABLE x (
     attributes map<string,string>,
     business_id string,
     categories array<string>,
     hours map<string,struct<close:string, open:string>>,
     name string,
     neighborhoods array<string>,
     open boolean,
     review_count int,
     type string
    ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

hive> load data local inpath 'json.data'  overwrite into  table x;
hive> Table default.x stats: [numFiles=1, numRows=0, totalSize=416,rawDataSize=0]
OK
hive> select * from x;
OK
{"accepts credit cards":"true","by appointment only":"true",
  "good for groups":"1"}    
  vcNAWiLM4dR7D2nwwJ7nCA    
  ["Doctors","Health & Medical"]    
  {"tuesday":{"close":"17:00","open":"08:00"},
   "friday":{"close":"17:00","open":"08:00"}}   
    Eric Goldberg, MD   ["HELLO"]   true    9   business
Time taken: 0.335 seconds, Fetched: 1 row(s)
hive>

A few notes though:

  • Notice I used a different JSON SerDe because I don't have on my system the one you used. I used this one, I like it better because, well, I wrote it. But the create statement should work just as well with the other serde.
  • You may want to convert some of those maps to structs, as they may be more convenient to query. For instance, attributes could be a struct, but you'd need to map the names with a space in them like accepts credit cards. My SerDe allows to map a json attribute to a different hive column name. That is also needed then JSON uses an attribute that is a hive keyword like 'timestamp' or 'create'.


来源:https://stackoverflow.com/questions/33927116/load-complex-json-in-hive-using-jsonserde

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!