What exactly Non DFS Used means?

前端 未结 5 847
庸人自扰
庸人自扰 2020-12-13 06:56

This is what I saw on Web UI recently

 Configured Capacity     :   232.5 GB
 DFS Used    :   112.44 GB
 Non DFS Used    :   119.46 GB
 DFS Remaining   :   61         


        
5条回答
  •  萌比男神i
    2020-12-13 07:27

    "Non DFS used" is calculated by following formula:

    Non DFS Used = Configured Capacity - Remaining Space - DFS Used

    It is still confusing, at least for me.

    Because Configured Capacity = Total Disk Space - Reserved Space.

    So Non DFS used = ( Total Disk Space - Reserved Space) - Remaining Space - DFS Used

    Let's take a example. Assuming I have 100 GB disk, and I set the reserved space (dfs.datanode.du.reserved) to 30 GB.

    In the disk, the system and other files used up to 40 GB, DFS Used 10 GB. If you run df -h , you will see the available space is 50GB for that disk volume.

    In HDFS web UI, it will show

    Non DFS used = 100GB(Total) - 30 GB( Reserved) - 10 GB (DFS used) - 50GB(Remaining) = 10 GB

    So it actually means, you initially configured to reserve 30G for non dfs usage, and 70 G for HDFS. However, it turns out non dfs usage exceeds the 30G reservation and eat up 10 GB space which should belongs to HDFS!

    The term "Non DFS used" should really be renamed to something like "How much configured DFS capacity are occupied by non dfs use"

    And one should stop try to figure out why the non dfs use are so high inside hadoop.

    One useful command is lsof | grep delete, which will help you identify those open file which has been deleted. Sometimes, Hadoop processes (like hive, yarn, and mapred and hdfs) may hold reference to those already deleted files. And these references will occupy disk space.

    Also du -hsx * | sort -rh | head -10 helps list the top ten largest folders.

提交回复
热议问题