Reducers for Hive data | 易学教程

问题

I'm a novice. I'm curious to know how reducers are set to different hive data sets. Is it based on the size of the data processed? Or a default set of reducers for all?

For example, 5GB of data requires how many reducers? will the same number of reducers set to smaller data set?

Thanks in advance!! Cheers!

回答1:

In open source hive (and EMR likely)

# reducers = (# bytes of input to mappers)
             / (hive.exec.reducers.bytes.per.reducer)

default hive.exec.reducers.bytes.per.reducer is 1G.

Number of reducers depends also on size of the input file You could change that by setting the property hive.exec.reducers.bytes.per.reducer:

either by changing hive-site.xml

hive.exec.reducers.bytes.per.reducer 1000000

or using set

hive -e "set hive.exec.reducers.bytes.per.reducer=100000

回答2:

In a MapReduce program, reducer is gets assigned based on key in the reducer input.Hence the reduce method is called for each pair in the grouped inputs.It is not dependent of data size.

Suppose if you are going a simple word count program and file size is 1 MB but mapper output contains 5 key which is going to reducer for reducing then there is a chance to get 5 reducer to perform that task.

But suppose if you have 5GB data and mapper output contains only one key then only one reducer will get assigned to process the data into reducer phase.

Number of reducer in hive is also controlled by following configuration:

mapred.reduce.tasks
Default Value: -1

The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer
Default Value: 1000000000

The default is 1G, i.e if the input size is 10G, it will use 10 reducers.

hive.exec.reducers.max
Default Value: 999

Max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers.

How Many Reduces?

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum). With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.

Source: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Please check below link to get more clarification about reducer.

Hadoop MapReduce: Clarification on number of reducers

回答3:

hive.exec.reducers.bytes.per.reducer

Default Value: 1,000,000,000 prior to Hive 0.14.0; 256 MB (256,000,000) in Hive 0.14.0 and later

Source: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

来源：https://stackoverflow.com/questions/30368437/reducers-for-hive-data

标签

MapReduce

Hive