Parquet-backed Hive table: array column not queryable in Impala

流过昼夜 提交于 2019-12-01 06:27:08

问题


Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.

I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!

As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:

123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU

A Hive table was created:

CREATE TABLE `id_member_of`(
  `id` INT, 
  `member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  COLLECTION ITEMS TERMINATED BY '|' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

The raw data was loaded into the Hive table:

LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;

A Parquet version of the table was created:

CREATE TABLE `id_member_of_parquet` (
 `id` STRING, 
 `member_of` ARRAY<STRING>) 
STORED AS PARQUET;

The data from the CSV-backed table was inserted into the Parquet table:

INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;

And the Parquet table is now queryable in Hive:

hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]

Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:

[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id  |
+-----+
| 123 |
| 234 |
+-----+

Question: What happened to the array column? Can you see what I'm doing wrong?


回答1:


It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.

Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id  | item  |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+


来源:https://stackoverflow.com/questions/37243714/parquet-backed-hive-table-array-column-not-queryable-in-impala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!