Reading DataFrame from partitioned parquet file

后端未结

关注

 3  1100

How to read partitioned parquet with condition as dataframe,

this works fine,

val dataframe = sqlContext.read.parquet(\"file:///home/msoproj/dev_data


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  天命终不由人        
                
              
                            
                2020-12-04 18:46
              
            
            
                                                                       
you need to provide mergeSchema = true option. like mentioned below (this is from 1.6.0):

val dataframe = sqlContext.read.option("mergeSchema", "true").parquet("file:///your/path/data=jDD")


This will read all the parquet files into dataframe and also creates columns year, month and day in the dataframe data.

Ref: https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#schema-merging
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-04 18:47
              
            
            
                                                                       
sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like:

val dataframe = sqlContext
      .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")


If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.


  EDIT: As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":


val dataframe = sqlContext
     .read
     .option("basePath", "file:///your/path/")
     .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2020-12-04 18:48
              
            
            
                                                                       
If you want to read for multiple days, for example day = 5 and day = 6 and want to mention the range in the path itself, wildcards can be used:

val dataframe = sqlContext
  .read
  .parquet("file:///your/path/data=jDD/year=2015/month=10/day={5,6}/*")


Wildcards can also be used to specify a range of days:

val dataframe = sqlContext
  .read
  .parquet("file:///your/path/data=jDD/year=2015/month=10/day=[5-10]/*")


This matches all days from 5 to 10.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复