Hadoop client and cluster separation

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?

Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.

And I have some naiive ideas:

1.Create a client machine and install hadoop .

2.set fs.default.name to be hdfs://master:9000

3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode Is it correct?

4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.

5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.

Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.

If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

First of all.. this link has detailed information on how client communcates with namenode

http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2

To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.

Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.

Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode. To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode. Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.

Go through configuration on Namenode,

core-site.xml will have this property-

<property>
        <name>fs.default.name</name>
        <value>192.168.0.1:9000</value>
</property>

mapred-site.xml will have this property-

<property>
      <name>mapred.job.tracker</name>
      <value>192.168.0.1:8021</value>
 </property>

These are two important properties must be copied to client machine’s Hadoop configuration. And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.

<property>
      <name>mapreduce.jobtracker.staging.root.dir</name>
      <value>/user</value>
</property>

Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.

Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.

来源：https://stackoverflow.com/questions/35314130/hadoop-client-and-cluster-separation

标签

Hadoop

cluster-computing

vpn

hadoop2