There are 0 datanode(s) running and no node(s) are excluded in this operation

匿名 (未验证) 提交于 2019-12-03 02:13:02

问题:

I have set up a multi node Hadoop Cluster. The NameNode and Seconaday namenode runs on the same machine and the cluster has only one Datanode. All the nodes are configured on Amazon EC2 machines.

Following are the configuration files on the master node

masters 54.68.218.192 (public IP of the master node)  slaves 54.68.169.62 (public IP of the slave node)   core-site.xml  <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>   mapred-site.xml  <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>   hdfs-site.xml   <configuration>  <property>  <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property> </configuration> 

Now are the configuration files on the datanode

core-site.xml   <configuration> <property> <name>fs.default.name</name> <value>hdfs://54.68.218.192:10001</value> </property> </configuration> 

mapred-site.xml

<configuration> <property> <name>mapred.job.tracker</name> <value>54.68.218.192:10002</value> </property> </configuration> 

hdfs-site.xml

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property> </configuration> 

the jps run on the Namenode give the following 5696 NameNode 6504 Jps 5905 SecondaryNameNode 6040 ResourceManager

and jps on datanode 2883 DataNode 3496 Jps 3381 NodeManager

which to me seems right.

Now when I try to run a put command: hadoop fs -put count_inputfile /test/input/

it gives me the following error: put: File /count_inputfile.COPYING could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.

the logs on the datanode says the following

hadoop-datanode log INFO org.apache.hadoop.ipc.Client: Retrying connect to server:      54.68.218.192/54.68.218.192:10001. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 

yarn-nodemanager log INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

the web UI of node manager(50070) shows that there are 0 live nodes and 0 dead nodes and the dfs used is 100%

I have also disabled IPV6. on a few websites I found out that I should also edit the /etc/hosts file I have also edited them and they look like this 127.0.0.1 localhost 172.31.25.151 ip-172-31-25-151.us-west-2.compute.internal 172.31.25.152 ip-172-31-25-152.us-west-2.compute.internal

Why I am still geting the error?

回答1:

Two things worked for me,

STEP 1 : stop hadoop and clean temp files from hduser

sudo rm -R /tmp/* 

also, you may need to delete and recreate /app/hadoop/tmp (mostly when I change hadoop version from 2.2.0 to 2.7.0)

sudo rm -r /app/hadoop/tmp sudo mkdir -p /app/hadoop/tmp sudo chown hduser:hadoop /app/hadoop/tmp sudo chmod 750 /app/hadoop/tmp 

STEP 2: format namenode

hdfs namenode -format 

Now, I can see DataNode

hduser@prayagupd:~$ jps 19135 NameNode 20497 Jps 19477 DataNode 20447 NodeManager 19902 SecondaryNameNode 20106 ResourceManager 


回答2:

I had the same problem after improper shutdown of the node. Also checked in the UI the datanode is not listed.

Now it's working after deleting the files from datanode folder and restarting services.

stop-all.sh

rm -rf /usr/local/hadoop_store/hdfs/datanode/*

start-all.sh



回答3:

@Learner,
I had this problem of datanodes not shown in the Namenode's web UI. Solved it by these steps in Hadoop 2.4.1.

do this for all the nodes (master and slaves)

1. remove all temporary files ( by default in /tmp) - sudo rm -R /tmp/*.
2. Now try connecting to all nodes through ssh by using ssh username@host and add keys in your master using ssh-copy-id -i ~/.ssh/id_rsa.pub username@host to give unrestricted access of slaves to the master (not doing so might be the problem for refusing connections).
3. Format the namenode using hadoop namenode -format and try restarting the daemons.



回答4:

On my situation, firewalld service was running. It was on default configuration. And it don't allow the communication between nodes. My hadoop cluster was a test cluster. Because of this, I stopped the service. If your servers are in production, you should allow hadoop ports on firewalld, instead of

service firewalld stop chkconfig firewalld off 


回答5:

It is probably because the cluster ID of the datanodes and the namenodes or node manager do not match. The cluster ID can be seen in the VERSION file found in both the namenode and datanodes .

This happens when you format your namenode and then restart the cluster but the datanodes still try connecting using the previous clusterID . to be successfully connected you need the correct IP address and also a matching cluster ID on the nodes.

So try reformatting the namenode and datanodes or just configure the datanodes and namenode on newly created folders.

That should solve your problem.

Deleting the files from the current datanodes folder will also remove the old VERSION file and will request for a new VERSION file while reconnecting with the namenode.

Example you datanode directory in the configuration is /hadoop2/datanode

$ rm -rvf /hadoop2/datanode/* 

And then restart services If you do reformat your namenode do it before this step. Each time you reformat your namenode it gets a new ID and that ID is randomly generated and will not match the old ID in your datanodes

So every time follow this sequence

if you Format namenode then Delete the contents of datanode directory OR configure datanode on newly created directory Then start your namenode and the datanodes



回答6:

I had same error. I had not permission to hdfs file system. So I give permission to my user:

chmod 777 /usr/local/hadoop_store/hdfs/namenode chmod 777 /usr/local/hadoop_store/hdfs/datanode 


回答7:

Value for property {fs.default.name} in core-site.xml, on both the master and slave machine, must point to master machine. So it will be something like this:

<property>      <name>fs.default.name</name>      <value>hdfs://master:9000</value> </property> 

where master is the hostname in /etc/hosts file pointing to the master node.



回答8:

1) Stop all services first using command stop-all.sh

2) Delete all files inside datanode rm -rf /usr/local/hadoop_store/hdfs/datanode/*

3) Then start all services using command start-all.sh

You can check if all of your services are running using jps command

Hope this should work!!!



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!