Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)

风流意气都作罢 提交于 2019-11-27 05:22:31

If final reducer is a join then it looks like skew in join key. First of all check two things:

check that b.f1 join key has no duplicates:

select b.f1, count(*) cnt from B b 
 group by b.f1 
having count(*)>1 order by cnt desc;

check the distribution of a.f1:

select a.f1, count(*) cnt from A a
 group by a.f1  
order by cnt desc
limit 10;

This query will show skewed keys.

If there is a skew (too many rows with the same value) then join skewed keys separately, use union all:

SELECT a.f1, f2, ..., fn
  FROM ( select * from A where f1 = skewed_value) as a --skewed
  LEFT JOIN B as b
  ON a.f1 = b.f1
WHERE {PARTITION_FILTER}
UNION ALL
SELECT a.f1, f2, ..., fn
  FROM ( select * from A where f1 != skewed_value) as a --all other
  LEFT JOIN B as b
  ON a.f1 = b.f1
WHERE {PARTITION_FILTER}

And finally if there is no issues with skew and duplication, then try to increase reducers parallelism: Get current bytes per reducer configuration

set hive.exec.reducers.bytes.per.reducer;

typically this will return some value about 1G. Try to divide by two, set new value before your query and check how many reducers will it start and performance. Success criteria is more reducers has started and performance improved.

set hive.exec.reducers.bytes.per.reducer=67108864;

The less the bytes per reducer the more reducers will be started, increasing parallelism;

UPDATE: Try to enable map-join, your second table is small enough to fit in memory, mapjoin will work without reducers at all and it will be no problem with skew on reducers.

How to enable mapjoin: https://stackoverflow.com/a/49154414/2700344

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!