What exactly Non DFS Used means?

前端未结

关注

 5  848

This is what I saw on Web UI recently

 Configured Capacity     :   232.5 GB
 DFS Used    :   112.44 GB
 Non DFS Used    :   119.46 GB
 DFS Remaining   :   61


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2020-12-13 07:27
              
            
            
                                                                       
"Non DFS used" is calculated by following formula:

Non DFS Used = Configured Capacity - Remaining Space - DFS Used 

It is still confusing, at least for me. 

Because 
Configured Capacity = Total Disk Space - Reserved Space. 

So Non DFS used = ( Total Disk Space - Reserved Space) - Remaining Space - DFS Used 

Let's take a example. Assuming I have 100 GB disk, and I set the reserved space (dfs.datanode.du.reserved) to 30 GB. 

In the disk, the system and other files used up to 40 GB, DFS Used 10 GB. If you run df -h
, you will see the available space is 50GB for that disk volume. 

In HDFS web UI, it will show 

Non DFS used = 100GB(Total) - 30 GB( Reserved) - 10 GB (DFS used) - 50GB(Remaining)
             = 10 GB

So it actually means, you initially configured to reserve 30G for non dfs usage, and 70 G for HDFS. However,  it turns out non dfs usage exceeds the 30G reservation and eat up 10 GB space which should belongs to HDFS!

The term "Non DFS used" should really be renamed to something like "How much configured DFS capacity are occupied by non dfs use"

And one should stop try to figure out why the non dfs use are so high inside hadoop.  

One useful command is lsof | grep delete, which will help you identify those open file which has been deleted.  Sometimes, Hadoop processes (like hive, yarn, and mapred and hdfs) may hold reference to those already deleted files. And these references will occupy disk space. 

Also du -hsx * | sort -rh | head -10 helps list the top ten largest folders. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  Happy的楠姐        
                
              
                            
                2020-12-13 07:42
              
            
            
                                                                       
Non DFS used is any data in the filesystem of the data node(s) that isn't in dfs.data.dirs.  This would include log files, mapreduce shuffle output and local copies of data files (if you put them on a data node). Use du or a similar tool to see whats taking up the space in your filesystem.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2020-12-13 07:43
              
            
            
                                                                       
The non-dfs will be some cache files that will be stored by the node manager. You can check the path under yarn.nodemanager.local-dirs property in the yarn-site.xml

You can refer to 
the default yarn-site.xml for details.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-12-13 07:49
              
            
            
                                                                       
The correct simplified definition is: "Any data that is not written by HDFS in the same filesystem(s) as the dfs.data.dirs. In other words, if you use hdfs dfs commands to copy data, it ends up under dfs.data.dirs but then it is considered "DFS usage", and if you use regular cp command to copy files into dfs.data.dirs, then it will become "non-DFS usage".
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-13 07:49
              
            
            
                                                                       
One more thing.

Non DFS used =
100GB(Total) - 30 GB( Reserved) - 10 GB (DFS used) - 50GB(Remaining) 
= 10 GB

Because ext3/ext4 default reserve 5% (refer to reserved block count ), so it should be

Non DFS used = 
100GB(Total) - 30 GB( Reserved from App) - 5 GB(Reserved from FS) - 10 GB (DFS used) - 50GB(Remaining) 
= 5 GB  

From sudo tune2fs -l /dev/sdm1 get the "Reserved block count"

BTW, tune2fs -m 0.2 /dev/sdm1 to tune the reserved space.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复