Best way to sort 1M records in Python

前端未结

关注

 11  1634

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following

myHashTable = {}
myLists = { \'hits\':{}, \'misses\':{},


                      
              相关标签:


      
      
        
          11条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2020-12-05 09:17
              
            
            
                                                                       
This seems to be pretty fast.

raw= [ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
    {'id':'id2', 'hits':300, 'misses':100, 'total':500},
    {'id':'id3', 'hits':100, 'misses':400, 'total':600}
]

hits= [ (r['hits'],r['id']) for r in raw ]
hits.sort()

misses = [ (r['misses'],r['id']) for r in raw ]
misses.sort()

total = [ (r['total'],r['id']) for r in raw ]
total.sort()


Yes, it makes three passes through the raw data.  I think it's faster than pulling out the data in one pass.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2020-12-05 09:24
              
            
            
                                                                       
If you have a fixed number of fields, use tuples instead of dictionaries. Place the field you want to sort on in first position, and just use mylist.sort()
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  刺人心        
                
              
                            
                2020-12-05 09:25
              
            
            
                                                                       
Glenn Maynard is correct that a sorted mapping would be appropriate here. This is one for python: http://wiki.zope.org/ZODB/guide/node6.html#SECTION000630000000000000000
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  挽巷        
                
              
                            
                2020-12-05 09:27
              
            
            
                                                                       
I've done some quick profiling of both the original way and SLott's proposal.  In neither case does it take 5-10 minutes per field. The actual sorting is not the problem.  It looks like most of the time is spent in slinging data around and transforming it.  Also, my memory usage is skyrocketing - my python is over 350 megs of ram!  are you sure you're not using up all your ram and paging to disk?  Even with my crappy 3 year old power saving processor laptop, I am seeing results way less than 5-10 minutes per key sorted for a million items. What I can't explain is the variability in the actual sort() calls.  I know python sort is extra good at sorting partially sorted lists, so maybe his list is getting partially sorted in the transform from the raw data to the list to be sorted.

Here's the results for slott's method:

done creating data
done transform.  elapsed: 16.5160000324
sorting one key slott's way takes 1.29699993134


here's the code to get those results:

starttransform = time.time()
hits= [ (r['hits'],r['id']) for r in myList ]
endtransform = time.time()
print "done transform.  elapsed: " + str(endtransform - starttransform)
hits.sort()
endslottsort = time.time()
print "sorting one key slott's way takes " + str(endslottsort - endtransform)


Now the results for the original method, or at least a close version with some instrumentation added:

done creating data
done transform.  elapsed: 8.125
about to get stuff to be sorted 
done getting data. elapsed time: 37.5939998627
about to sort key hits
done  sorting on key <hits> elapsed time: 5.54699993134


Here's the code:

for k, v in myLists.iteritems():
    time1 = time.time()
    print "about to get stuff to be sorted "
    tobesorted = myLists[k].items()
    time2 = time.time()
    print "done getting data. elapsed time: " + str(time2-time1)
    print "about to sort key " + str(k) 
    mysorted[k] = tobesorted.sort( key=itemgetter(1))
    time3 = time.time()
    print "done  sorting on key <" + str(k) + "> elapsed time: " + str(time3-time2)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2020-12-05 09:28
              
            
            
                                                                       
I would look into using a different sorting algorithm. Something like a Merge Sort might work. Break the list up into smaller lists and sort them individually. Then loop.

Pseudo code:

list1 = []  // sorted separately
list2 = []  // sorted separately

// Recombine sorted lists
result = []
while (list1.hasMoreElements || list2.hasMoreElements):
   if (! list1.hasMoreElements):
       result.addAll(list2)
       break
   elseif (! list2.hasMoreElements):
       result.AddAll(list1)
       break

   if (list1.peek < list2.peek):
      result.add(list1.pop)
   else:
      result.add(list2.pop)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2020-12-05 09:32
              
            
            
                                                                       
Instead of trying to keep your list ordered, maybe you can get by with a heap queue.  It lets you push any item, keeping the 'smallest' one at h[0], and popping this item (and 'bubbling' the next smallest) is an O(nlogn) operation.

so, just ask yourself: 


do i need the whole list ordered all the time? : use an ordered structure (like Zope's BTree package, as mentioned by Ealdwulf)
or the whole list ordered but only after a day's work of random insertions?: use sort like you're doing, or like S.Lott's answer
or just a few 'smallest' items at any moment? : use heapq

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复