What does CPU Time for a Hadoop Job signify?

為{幸葍}努か 提交于 2019-11-29 00:16:16

The map phase consists of: record reader, map, combiner, and partitioner.

The reduce phase consists of: shuffle, sort, reduce, output.

The CPU time you are seeing there is of the entire map phase and the entire reduce phase... not just the function itself. This is kind of confusing terminology because you have the map function and reduce function, which are only a portion of the map phase and reduce phase. This is the total CPU time across all of the nodes in the cluster.

CPU time is hugely different form real time. CPU time is how much time something spent on the CPUs, while real time is what you and I experience as humans. Think about this: assume you have the same job running over the same data but on one 20 node cluster, then a 200 node cluster. Overall, the same amount of CPU time will be used on both clusters, but the 200 node cluster will run 10x faster in real time. CPU time is a useful metric when you have a shared system with lots of jobs running on it at the same time.

I don't know how you would dive deeper to get CPU time in each phase. Using a date timer is probably not what you are looking for though.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!