When the underlying data changes, do we need to drop and create the partition in Hive?

余生长醉 提交于 2020-01-06 04:46:46

问题


Let's say I have a hive table partitioned by date with its data stored in S3 as Parquet files. Let's also assume that for a particular partition (date), there were originally 20 records.

If I then delete the original files and put new Parquet files with 50 records in the same folder, do I need to drop and recreate that partition for the new data to reflect?

My understanding was that we don't have to recreate partitions. So I tried removing old data from the respective folder and keeping the new data without "updating" the Hive partition. However, then when I took count(*) for that date, it still showed as 20 records instead of 50. Upon dropping and creating the partition again, it started showing the correct count. Is that the expected behavior?


回答1:


Hive optimizes simple queries like select count(*) using statistics. If this property is set:

set hive.compute.query.using.stats=true;

Then Hive will take count from statistics stored in the metadata.

When you replaced file with new one, statistics remains the same. When you deleted partition, all related statistics was also deleted, this is why you got correct count after re-creating partition.

See also this answer: HIVE select count() non null returns higher value than select count() - Predicate prevents statistics usage in this case.

This behavior is quite expected. You can either

set hive.compute.query.using.stats=false;

To switch-off statistics usage for query result calculation, your partition recreation effectively does the same because it removed statistics, this is why statistics was not used and file was scanned.

Or you can analyze table to have statistics updated and keep above parameter set true, so next time you will perform simple aggregation, it will work fast:

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  
  COMPUTE STATISTICS

For small file with 50 records only performance difference is not so big. But better to have statistics updated, it also being used by optimizer to build optimal query plan.

More details here: analyze table

And if you inserting data using INSERT OVERWRITE, you can enable statistics auto-gathering:

set hive.stats.autogather=true;


来源:https://stackoverflow.com/questions/56314296/when-the-underlying-data-changes-do-we-need-to-drop-and-create-the-partition-in

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!