DataFrame-ified zipWithIndex

后端未结
关注
 8  1456
悲哀的现实 2020-11-27 04:23
I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD

      
      
        
          8条回答        

        
                    
            
            
                         
                
              
              
                
                   一整个雨季
                                             
                
                
                (楼主)
            
              
              
                2020-11-27 04:47
              

            
            
                        
I have modified @Tagar's version to run on Python 3.7, wanted to share:

def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
    Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
    and preserves a schema

    :param df: source dataframe
    :param offset: adjustment to zipWithIndex()'s index
    :param colName: name of the index column
'''

new_schema = StructType(
                [StructField(colName,LongType(),True)]        # new added field in front
                + df.schema.fields                            # previous schema
            )

zipped_rdd = df.rdd.zipWithIndex()

new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))      # use this for python 3+, tuple gets passed as single argument so using args and [] notation to read elements within args
return spark.createDataFrame(new_rdd, new_schema)

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它8个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复