PySpark - Adding a Column from a list of values using a UDF

前端未结

关注

 5  1500

臣服心动 2021-01-05 00:32

I have to add column to a PySpark dataframe based on a list of values.

a= spark.createDataFrame([(\"Dog\", \"Cat\"), (\"Cat\", \"Dog\"), (\"Mouse\", \"Cat\"


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   小蘑菇
                                             
                
                
                (楼主)
            
              
              
                2021-01-05 01:30
              

            
            
                        
I might be wrong, but I believe the accepted answer will not work. monotonically_increasing_id only guarantees that the ids will be unique and increasing, not that they will be consecutive. Hence using it on two different dataframes will likely create two very different columns, and the join will mostly return empty.

taking inspiration from this answer https://stackoverflow.com/a/48211877/7225303 to a similar question, we could change the incorrect answer to:

from pyspark.sql.window import Window as W
from pyspark.sql import functions as F

a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
                               ["Animal", "Enemy"])

a.show()

+------+-----+
|Animal|Enemy|
+------+-----+
|   Dog|  Cat|
|   Cat|  Dog|
| Mouse|  Cat|
+------+-----+



#convert list to a dataframe
rating = [5,4,1]
b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])
b.show()

+------+
|Rating|
+------+
|     5|
|     4|
|     1|
+------+


a = a.withColumn("idx", F.monotonically_increasing_id())
b = b.withColumn("idx", F.monotonically_increasing_id())

windowSpec = W.orderBy("idx")
a = a.withColumn("idx", F.row_number().over(windowSpec))
b = b.withColumn("idx", F.row_number().over(windowSpec))

a.show()
+------+-----+---+
|Animal|Enemy|idx|
+------+-----+---+
|   Dog|  Cat|  1|
|   Cat|  Dog|  2|
| Mouse|  Cat|  3|
+------+-----+---+

b.show()
+------+---+
|Rating|idx|
+------+---+
|     5|  1|
|     4|  2|
|     1|  3|
+------+---+

final_df = a.join(b, a.idx == b.idx).drop("idx")

+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Dog|  Cat|     5|
|   Cat|  Dog|     4|
| Mouse|  Cat|     1|
+------+-----+------+

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复