spark dataframe drop duplicates and keep first

后端未结

关注

 5  714

孤街浪徒 2020-12-05 02:47

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?

Pandas:

df.sort_value


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   臣服心动
                                             
                
                
                (楼主)
            
              
              
                2020-12-05 03:01
              

            
            
                        
solution 1
add a new column row num(incremental column) and drop duplicates based the min row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)

solution 2:
turn the data-frame into a rdd (df.rdd) then group the rdd on one or more or all keys and then run a lambda function on the group and drop the rows the way you want and return only the row that you are interested in.

One of my friend (sameer) mentioned that below(old solution) didn't work for him.
use dropDuplicates method by default it keeps the first occurance.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复