In Hive, which query is better and why?

北战南征 提交于 2020-08-10 04:23:48

问题


Assume there are two queries:

  1. select count(distinct a) from x;

  2. select count(*) from (select distinct a from x) y;

I know they return the same results, but from the perspective of Hive (using MapReduce). Can anyone please explain which one is the better choice and why?

Any help is appreciated.


回答1:


In Hive versions prior 1.2.0 the first query executes using one Map and one Reduce stages. Map sends each value to the single reducer, and reducer does all the job.

Single reducer processing too much data in this case.

During second query execution, mappers output distributed between many reducers, each reducer generates it's distinct list and final map-reduce job does summarize the size of each list.

Since Hive 1.2.0 Hive 1.2.0+ provides auto-rewrite optimization hive.optimize.distinct.rewrite=true/false, see HIVE-10568

See also Hortonworks community




回答2:


2nd Query: select count(*) from (select distinct a from x) y;

is upto 3.x faster than

1st Query: select count(distinct a) from x;

Please refer to https://issues.apache.org/jira/browse/HIVE-10568

Executed both the queries in Hive, first query executed in 1 stage with 1 reducer.

MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 46.51 sec HDFS Read: 42857 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 46 seconds 510 msec

Second query executed in 2 stages, with improved parallelism.

MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 13.93 sec HDFS Read: 42857 HDFS Write: 115 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 5.83 sec HDFS Read: 510 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 19 seconds 760 msec



来源:https://stackoverflow.com/questions/51489858/in-hive-which-query-is-better-and-why

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!