Spark: Sort records in groups?

后端未结

关注

 4  1186

忘掉有多难 2020-12-31 11:34

I have a set of records which I need to:

1) Group by \'date\', \'city\' and \'kind\'

2) Sort every group by \'prize

In my code:

impor


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   灰色年华
                                             
                
                
                (楼主)
            
              
              
                2020-12-31 12:29
              

            
            
                        
groupByKey is expensive, it has 2 implications:


Majority of the data get shuffled in the remaining N-1 partitions in average.
All of the records of the same key get loaded in memory in the single executor potentially causing memory errors.


Depending of your use case you have different better options:


If you don't care about the ordering, use reduceByKey or aggregateByKey.
If you want to just group and sort without any transformation, prefer using repartitionAndSortWithinPartitions (Spark 1.3.0+ http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions) but be very careful of what partitioner you specify and test it because you are now relying on side effects that may change behaviour in a different environment. See also examples in this repository: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala.
If you are either applying a transformation or a non reducible aggregation (fold or scan) applied to the iterable of sorted records, then check out this library: spark-sorted https://github.com/tresata/spark-sorted. It provides 3 APIs for paired rdds: mapStreamByKey, foldLeftByKey and scanLeftByKey.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复