How to find the CPU time taken by a Map/Reduce task in Hadoop

问题

I am writing a Hadoop scheduler. My scheduling requires finding the CPU time taken by each Map/Reduce task.

I know that:

The TaskInProgress class maintains the execStartTime and execFinishTime values which are wall-clock times when the process started and finished, but they do not accurately indicate the CPU time consumed by the task.
Each task is executed in a new JVM, and I could use the OperatingSystemMXBean.getProcessCpuTime () method, but again the description of the method tells me: "Returns the CPU time used by the process on which the Java virtual machine is running in nanoseconds". I am not entirely clear if this is what I want.

回答1:

I am using a library that records resource metrics like CPU Usage/IDLE time, swap usage and memory usage.

http://code.google.com/p/hadoop-toolkit/

You have to extract a patch and apply it to a 20.2 tag version.

I am not entirely clear if this is what I want.

I am pretty sure that this method returns the wall clock time as well.

回答2:

Just for posterity, I solved this problem by making a change in src/mapred/org/apache/hadoop/mapred/TaskLog.java (Hadoop 0.20.203) on line 572

mergedCmd.append("exec setsid 'time' ");    // add 'time'

The CPU time will be written to: logs/userlogs/JOBID/TASKID/stderr. I also wrote a script to reap the cumulative CPU time: https://gist.github.com/1984365 Before running the job, you need to make sure you do:

rm -rf logs/userlogs/*

so that the script works.

来源：https://stackoverflow.com/questions/9365812/how-to-find-the-cpu-time-taken-by-a-map-reduce-task-in-hadoop

标签

Hadoop

MapReduce

scheduling