In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typ
The reduce phase can start long before a reducer is called. As soon as "a" mapper finishes the job, the generated data undergoes some sorting and shuffling (which includes call to combiner and partitioner). The reducer "phase" kicks in the moment post mapper data processing is started. As these processing is done, you will see progress in reducers percentage. However, none of the reducers have been called in yet. Depending on number of processors available/used, nature of data and number of expected reducers, you may want to change the parameter as described by @Donald-miner above.