Loading more records than actual in HIve

痞子三分冷 提交于 2019-12-08 08:57:50

问题


While inserting from Hive table to HIve table, It is loading more records that actual records. Can anyone help in this weird behaviour of Hive ?

My query would be looking like this:

insert overwrite table_a
    select col1,col2,col3,... from table_b;

My table_b consists of 6405465 records.

After inserting from table_b to table_a, i found total records in table_a are 6406565.

Can any one please help here ?


回答1:


If hive.compute.query.using.stats=true; then optimizer is using statistics for query calculation instead of querying table data. This is much faster because metastore is a fast database like MySQL and does not require map-reduce. But statistics can be not fresh (stale) if the table was loaded not using INSERT OVERWRITE or configuration parameter hive.stats.autogather responsible for statistics auto gathering was set to false. Also statistics will be not fresh after loading files or after using third-party tools. It's because files was never analyzed, statistics in metastore is not fresh, if you have put new files, nobody knows about how the data was changed. Also after sqoop loading, etc. So, it's a good practice to gather statistics for table or partition after loading using 'ANALYZE TABLE ... COMPUTE STATISTICS'.

In case it's impossible to gather statistics automatically (works for INSERT OVERWRITE) or by running ANALYZE statement then better to switch off hive.compute.query.using.stats parameter. Hive will query data instead of using statistics.

See this for reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive



来源:https://stackoverflow.com/questions/40959091/loading-more-records-than-actual-in-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!