How to fix “Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds.”

问题

I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*>. In the reducer, I will process these pairs.

But when I run the job, the mapper completes as expected, but reducer always complain that

Task attempt_* failed to report status for 600 seconds.

I know this is due to failed to update status, so I added a call to context.progress() in my code like this:

int count = 0;
while (values.hasNext()) {
  if (count++ % 100 == 0) {
    context.progress();
  }
  /*other code here*/
}

Unfortunately, this does not help. Still many reduce tasks failed.

Here is the log:

Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!

BTW, the error happened in reduce to copy phase, the log says:

reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385

Thanks for the help.

回答1:

The easiest way will be to set this configuration parameter:

<property>
  <name>mapred.task.timeout</name>
  <value>1800000</value> <!-- 30 minutes -->
</property>

in mapred-site.xml

回答2:

The easiest another way is to set in your Job Configuration inside the program

 Configuration conf=new Configuration();
 long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
 conf.setLong("mapred.task.timeout", milliSeconds);

**before setting it please check inside the Job file(job.xml) file in jobtracker GUI about the correct property name whether its mapred.task.timeout or mapreduce.task.timeout . . . while running the job check in the Job file again whether that property is changed according to the setted value.

回答3:

In newer versions, the name of the parameter has been changed to mapreduce.task.timeout as described in this link (search for task.timeout). In addition, you can also disable this timeout as described in the above link:

The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.

Below is an example setting in the mapred-site.xml:

<property>
  <name>mapreduce.task.timeout</name>
  <value>0</value> <!-- A value of 0 disables the timeout -->
</property>

回答4:

If you have hive query and its timing out , you can set above configurations in following way:

set mapred.tasktracker.expiry.interval=1800000;

set mapred.task.timeout= 1800000;

回答5:

From https://issues.apache.org/jira/browse/HADOOP-1763

causes might be :

1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run. 
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.

来源：https://stackoverflow.com/questions/5864589/how-to-fix-task-attempt-201104251139-0295-r-000006-0-failed-to-report-status-fo

标签

Hadoop

MapReduce