MapReduce job hangs, waiting for AM container to be allocated

前端 未结 9 984
天涯浪人
天涯浪人 2020-12-15 07:00

I tried to run simple word count as MapReduce job. Everything works fine when run locally (all work done on Name Node). But, when I try to run it on a cluster using YARN (ad

相关标签:
9条回答
  • 2020-12-15 07:13

    I feel, you are getting your memory settings wrong.

    To understand the tuning of YARN configuration, I found this to be a very good source: http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_yarn_tuning.html

    I followed the instructions given in this blog and was able to get my jobs running. You should alter your settings proportional to the physical memory you have on your nodes.

    Key things to remember is:

    • Values of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb should be at least yarn.scheduler.minimum-allocation-mb
    • Values of mapreduce.map.java.opts and mapreduce.reduce.java.opts should be around "0.8 times the value of" corresponding mapreduce.map.memory.mb and mapreduce.reduce.memory.mb configurations. (In my case it is 983 MB ~ (0.8 * 1228 MB))
    • Similarly, value of yarn.app.mapreduce.am.command-opts should be "0.8 times the value of" yarn.app.mapreduce.am.resource.mb

    Following are the settings I use and they work perfectly for me:

    yarn-site.xml:

    <property> 
        <name>yarn.scheduler.minimum-allocation-mb</name> 
        <value>1228</value>
    </property>
    <property> 
        <name>yarn.scheduler.maximum-allocation-mb</name> 
        <value>9830</value>
    </property>
    <property> 
        <name>yarn.nodemanager.resource.memory-mb</name> 
        <value>9830</value>
    </property>
    

    mapred-site.xml

    <property>  
        <name>yarn.app.mapreduce.am.resource.mb</name>  
        <value>1228</value>
    </property>
    <property> 
        <name>yarn.app.mapreduce.am.command-opts</name> 
        <value>-Xmx983m</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1228</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1228</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx983m</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx983m</value>
    </property>
    

    You can also refer to the answer here: Yarn container understanding and tuning

    You can add vCore settings, if you want your container allocation to take into account CPU also. But, for this to work, you need to use CapacityScheduler with DominantResourceCalculator. See the discussion about this here: How are containers created based on vcores and memory in MapReduce2?

    0 讨论(0)
  • 2020-12-15 07:13

    The first thing is to check yarn resource manager logs. I had searched the Internet about this problem for a very long time, but nobody told me how to find out what is really happening. It's so straightforward and simple to check yarn resource manager logs. I am confused why people ignore logs.

    For me, there was a error in log

    Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=172.16.0.167/172.16.0.167:55622]
    

    That's because I switched wifi network in my work place, so my computer IP changed.

    0 讨论(0)
  • 2020-12-15 07:17

    This has solved my case for this error:

    <property>
      <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
      <value>100</value>
    </property>
    
    0 讨论(0)
  • 2020-12-15 07:19

    You should check the status of Node managers in your cluster. If the NM nodes are short on disk space then RM will mark them "unhealthy" and those NMs can't allocate new containers.

    1) Check the Unhealthy nodes: http://<active_RM>:8088/cluster/nodes/unhealthy

    If the "health report" tab says "local-dirs are bad" then it means you need to cleanup some disk space from these nodes.

    2) Check the DFS dfs.data.dir property in hdfs-site.xml. It points the location on local file system where hdfs data is stored.

    3) Login to those machines and use df -h & hadoop fs - du -h commands to measure the space occupied.

    4) Verify hadoop trash and delete it if it's blocking you. hadoop fs -du -h /user/user_name/.Trash and hadoop fs -rm -r /user/user_name/.Trash/*

    0 讨论(0)
  • 2020-12-15 07:26

    anyway that's work for me .thank you a lot! @KaP

    that's my yarn-site.xml

    <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>MacdeMacBook-Pro.local</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.resourcemanager.webapp.address</name>
                <value>${yarn.resourcemanager.hostname}:8088</value>
        </property>
        <property>
           <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
    </property>
    <property>
       <name>yarn.scheduler.minimum-allocation-mb</name>
       <value>2048</value>
    </property>
    <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
    

    that's my mapred-site.xml

    <configuration>
    <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
    

    0 讨论(0)
  • 2020-12-15 07:30

    You have 512 MB RAM on each of the instance and all your memory configurations in yarn-site.xml and mapred-site.xml are 500 MB to 3 GB. You will not be able to run any thing on the cluster. Change every thing to ~256 MB.

    Also your mapred-site.xml is using framework to by yarn and you have job tracker address which is not correct. You need to have resource manager related parameters in yarn-site.xml on a multinode cluster (including resourcemanager web address). With out that, the cluster does not know where your cluster is.

    You need to revisit both your xml files.

    0 讨论(0)
提交回复
热议问题