As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.
There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:
- Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
- Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
- Calculating splits.
- Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
- Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
- Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.