Pyspark: Using repartitionAndSortWithinPartitions with multiple sort Critiria

前端未结

关注

 1  610

Assuming I am having the following RDD:

rdd = sc.parallelize([(\'a\', (5,1)), (\'d\', (8,2)), (\'2\', (6,3)), (\'a\', (8,2)), (\'d\', (9,6)), (\'b\', (3,4)),


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-17 05:49
              
            
            
                                                                       
It is possible but you'll have to include all required information in the composite key:

from pyspark.rdd import portable_hash

n = 2

def partitioner(n):
    """Partition by the first item in the key tuple"""
    def partitioner_(x):
        return portable_hash(x[0]) % n
    return partitioner_


(rdd
  .keyBy(lambda kv: (kv[0], kv[1][0]))  # Create temporary composite key
  .repartitionAndSortWithinPartitions(
      numPartitions=n, partitionFunc=partitioner(n), ascending=False)
  .map(lambda x: x[1]))  # Drop key (note: there is no partitioner set anymore)


Explained step-by-step:


keyBy(lambda kv: (kv[0], kv[1][0])) creates a substitute key which consist of original key and the first element of the value. In other words it transforms:

(0, (5,1))


into 

((0, 5), (0, (5, 1)))


In practice it can be slightly more efficient to simply reshape data to 

((0, 5), 1)

partitioner defines partitioning function based on a hash of the first element of the key so:

partitioner(7)((0, 5))
## 0

partitioner(7)((0, 6))
## 0

partitioner(7)((0, 99))
## 0

partitioner(7)((3, 99))
## 3


as you can see it is consistent and ignores the second bit.
we use default keyfunc function which is identity (lambda x: x) and depend on lexicographic ordering defined on Python tuple:

(0, 5) < (1, 5)
## True

(0, 5) < (0, 4)
## False



As mentioned before you could reshape data instead:

rdd.map(lambda kv: ((kv[0], kv[1][0]), kv[1][1]))


and drop final map to improve performance.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复