How do I correctly remove nodes in Hadoop?

后端 未结 3 1002
一生所求
一生所求 2020-12-12 19:33

I\'m running Hadoop 1.1.2 on a cluster with 10+ machines. I would like to nicely scale up and down, both for HDFS and MapReduce. By \"nicely\", I mean that I require that da

3条回答
  •  一生所求
    2020-12-12 20:11

    You should be aware that since for Hadoop to perform well, it really wants to have the data available in multiple copies. By removing nodes, you remove the chances of the data being optimally available, and you put extra stress on the cluster to ensure the availablility.

    I.e. by taking down a node, you do enfore that an extra copy of all its data is made somewhere else. So you shouldn't really be doing this just for fun, not unless you use a different data management paradigm than in the default configuration (= keep 3 copies in the cluster).

    And for a Hadoop cluster to perform well, you will want to actually store the data in the cluster. Otherwise, you can't really move the computation to the data, because the data isn't there yet either. Much about Hadoop is about having "smart drives" that can perform computation before sending the data across the network.

    So in order to make this reasonable, you will likely need to somehow split your cluster. Have one set of nodes keep the 3 master copies of the original data, and have some "add-on" nodes that are only used for storing intermediate data and perform computations on that part. Never change the master nodes, so they don't need to redistribute your data. Shut down add-on nodes only when they are empty? But that probably is not yet implemented.

提交回复
热议问题