AWS Glue Custom Classifiers Json Path

时光怂恿深爱的人放手 提交于 2019-12-01 19:28:09

It is a issue connected to Hive. I suggest two approaches. Firstly, you can create new table in Athena with struct data type like this:

CREATE EXTERNAL TABLE `example`(
`row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0', 
'CrawlerSchemaSerializerVersion'='1.0', 
'UPDATED_BY_CRAWLER'='example', 
'averageRecordSize'='271', 
'classification'='json', 
'compressionType'='none', 
'jsonPath'='$[*]', 
'objectCount'='1', 
'recordCount'='1', 
'sizeKey'='271', 
'transient_lastDdlTime'='1535533583', 
'typeOfData'='file')

And then you can run the query as follows:

SELECT row.client, row.filename, row.file_row_number FROM "example"

Secondly, you can re-design your json file as below and then run the Crawler again. In this example I used Single-JSON-Record-Per-Line format.

{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!