Why submitting job to mapreduce takes so much time in General?

后端 未结 3 1413
南方客
南方客 2020-12-19 07:19

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m. I want to understand what is the bottleneck

3条回答
  •  轮回少年
    2020-12-19 07:33

    As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.

    There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:

    1. Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
    2. Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
    3. Calculating splits.
    4. Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
    5. Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
    6. Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.

提交回复
热议问题