Using Spark to write a parquet file to s3 over s3a is very slow

前端未结
关注
 4  1005
闹比i 2020-12-04 18:16
I\'m trying to write a parquet file out to Amazon S3 using Spark 1.6.1. The small parquet that I\'m generating is

      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   遥遥无期
                                             
                
                
                (楼主)
            
              
              
                2020-12-04 18:43
              

            
            
                        
Spark defaults cause a large amount of (probably) unnecessary overhead during I/O operations, especially when writing to S3. This article discusses this more thoroughly, but there are 2 settings you'll want to consider changing.


Using the DirectParquetOutputCommitter. By default, Spark will save all of the data to a temporary folder then move those files afterwards. Using the DirectParquetOutputCommitter will save time by directly writting to the S3 output path


No longer available in Spark 2.0+


As stated in the jira ticket, the current solution is to 


  
  Switch your code to using s3a and Hadoop 2.7.2+ ; it's better all round, gets better in Hadoop 2.8, and is the basis for s3guard 
  Use the Hadoop FileOutputCommitter and set mapreduce.fileoutputcommitter.algorithm.version to 2
  




-Schema merging is turned off by default as of Spark 1.5  Turn off Schema Merging. If schema merging is on, the driver node will scan all of the files to ensure a consistent schema. This is especially costly because it is not a distributed operation. Make sure this is turned off by doing


val file = sqx.read.option("mergeSchema", "false").parquet(path)

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复