Is there a way to identify or detect data skew in Hive table?

烈酒焚心 提交于 2019-12-01 12:28:10

问题


We have many hive queries that take lot of time. We are using tez and other good practices like CBO, using orc files etc.

Is there a way to check / analyze data skew like some command? Would an explain plan help and if so, which parameter should I look for?


回答1:


Explain plan will not help in this, you should check data. If it is a join, select top 100 join key value from all tables involved in the join, do the same for partition by key if it is analytic function and you will see if it is a skew.

Example:

select key, count(*) cnt
   from table
  group by key
 having count(*)> 1000 --check also >1 for tables where it should not be duplication (like dimentions)
  order by cnt desc limit 100;

key can be complex join key (all columns you are using in the join ON condition).

Also have a look at this answer: https://stackoverflow.com/a/51061613/2700344



来源:https://stackoverflow.com/questions/53332761/is-there-a-way-to-identify-or-detect-data-skew-in-hive-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!