Multiple spark jobs appending parquet data to same base path with partitioning

后端未结

关注

 4  869

粉色の甜心 2020-12-08 01:07

I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning.

e.g.

dataFrame.write().


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   执笔经年
                                             
                
                
                (楼主)
            
              
              
                2020-12-08 01:26
              

            
            
                        
Instead of using partitionBy

dataFrame.write().
         partitionBy("eventDate", "category")
            .mode(Append)
            .parquet("s3://bucket/save/path");


Alternatively you can write the files as 

In job-1 specify the parquet file path as :

dataFrame.write().mode(Append)            
.parquet("s3://bucket/save/path/eventDate=20160101/channel=billing_events")


& in job-2 specify the parquet file path as :

dataFrame.write().mode(Append)            
.parquet("s3://bucket/save/path/eventDate=20160101/channel=click_events")



Both jobs will create seperate _temporary directory under the respective folder so concurrency issue is solved.
And partition discovery will also happen as eventDate=20160101 and for channel column.
Disadvantage - even if channel=click_events do not exists in data  still parquet file for the channel=click_events will be created.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复