Spark lists all leaf node even in partitioned data

前端未结

关注

 2  1856

盖世英雄少女心 2020-12-01 16:31

I have parquet data partitioned by date & hour, folder structure:

events_v3
  -- event_date=2015-01-01
    -- event_hour=2015-0


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   忘掉有多难
                                             
                
                
                (楼主)
            
              
              
                2020-12-01 17:04
              

            
            
                        
To clarify Gaurav's answer, that code snipped is from Hadoop branch-2, Probably not going to surface until Hadoop 2.9 (see HADOOP-13208); and someone needs to update Spark to use that feature (which won't harm code using HDFS, just won't show any speedup there).

One thing to consider is: what makes a good file layout for Object Stores.


Don't have deep directory trees with only a few files per directory
Do have shallow trees with many files
Consider using the first few characters of a file for the most changing value (such as day/hour), rather than the last. Why? Some object stores appear to use the leading characters for their hashing, not the trailing ones ... if you give your names more uniqueness then they get spread out over more servers, with better bandwidth/less risk of throttling.
If you are using the Hadoop 2.7 libraries, switch to s3a:// over s3n://. It's already faster, and getting better every week, at least in the ASF source tree.


Finally, Apache Hadoop, Apache Spark and related projects are all open source. Contributions are welcome. That's not just the code, it's documentation, testing, and, for this performance stuff, testing against your actual datasets. Even giving us details about what causes problems (and your dataset layouts) is interesting.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复