HiveSQL access JSON-array values

痞子三分冷 提交于 2020-01-06 09:12:25

问题


I have a table in Hive, which is generated by reading from a Sequence File in my HDFS. Those sequence files are json and look like this:

{"Activity":"Started","CustomerName":"CustomerName3","DeviceID":"StationRoboter","OrderID":"CustomerOrderID3","DateTime":"2018-11-27T12:56:47Z+0100","Color":[{"Name":"red","Amount":1},{"Name":"green","Amount":1},{"Name":"blue","Amount":1}],"BrickTotalAmount":3}

They submit product part colours and the amount of them which are counted in one service process run.

Please notice the json-array in color

Therefore my code to create the table is:

CREATE EXTERNAL TABLE iotdata(
  activity              STRING,
  customername          STRING,
  deviceid              STRING,
  orderid               STRING,
  datetime              STRING,
  color                 ARRAY<MAP<String,String>>,
  bricktotalamount      STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';

This works, and if I do a select * on that table it looks like this:

But my problem is, that I have to access the data inside the color column for analysis. For example, I want to calc all red values in the table.

So this leads to several opportunities and questions: how can I cast the amount string which is created to an integer?

How can I access the data in my color-column via select?

Or is there a possibility to change my table schema right at the beginning to get 4 extra columns for my 4 colours and 4 extra columns for the related colour amounts?

I also tried to read in the whole json as string to one column, and select the subcontent there, but this importing json array into hive leads me only to NULL values, propably because my json file is not 100% well-formed.


回答1:


You can do this in two steps.

Create proper JSON table

CREATE external TABLE temp.test_json (
  activity string,
  bricktotalamount int,
  color array<struct<amount:int, name:string>>,
  customername string,
  datetime string,
  deviceid string,
  orderid string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table'

Explode the Table in Select Statement

select activity, bricktotalamount, customername, datetime, deviceid, orderid, name, amount from temp.test_json
lateral view inline(color) c as amount,name




回答2:


The data inside of your array is definitely not a map for hive, you need to specify. I would recommend redefine your table specifying the structure of the array's data like this

CREATE EXTERNAL TABLE iotdata(
  activity              STRING,
  customername          STRING,
  deviceid              STRING,
  orderid               STRING,
  datetime              STRING,
  color ARRAY<STRUCT<NAME: STRING,AMOUNT:BIGINT>>
  bricktotalamount      STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';

in that way you should be able to the structure it self



来源:https://stackoverflow.com/questions/53670500/hivesql-access-json-array-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!