I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following
myHashTable = {}
myLists = { \'hits\':{}, \'misses\':{},
This seems to be pretty fast.
raw= [ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
{'id':'id2', 'hits':300, 'misses':100, 'total':500},
{'id':'id3', 'hits':100, 'misses':400, 'total':600}
]
hits= [ (r['hits'],r['id']) for r in raw ]
hits.sort()
misses = [ (r['misses'],r['id']) for r in raw ]
misses.sort()
total = [ (r['total'],r['id']) for r in raw ]
total.sort()
Yes, it makes three passes through the raw data. I think it's faster than pulling out the data in one pass.
If you have a fixed number of fields, use tuples instead of dictionaries. Place the field you want to sort on in first position, and just use mylist.sort()
Glenn Maynard is correct that a sorted mapping would be appropriate here. This is one for python: http://wiki.zope.org/ZODB/guide/node6.html#SECTION000630000000000000000
I've done some quick profiling of both the original way and SLott's proposal. In neither case does it take 5-10 minutes per field. The actual sorting is not the problem. It looks like most of the time is spent in slinging data around and transforming it. Also, my memory usage is skyrocketing - my python is over 350 megs of ram! are you sure you're not using up all your ram and paging to disk? Even with my crappy 3 year old power saving processor laptop, I am seeing results way less than 5-10 minutes per key sorted for a million items. What I can't explain is the variability in the actual sort() calls. I know python sort is extra good at sorting partially sorted lists, so maybe his list is getting partially sorted in the transform from the raw data to the list to be sorted.
Here's the results for slott's method:
done creating data
done transform. elapsed: 16.5160000324
sorting one key slott's way takes 1.29699993134
here's the code to get those results:
starttransform = time.time()
hits= [ (r['hits'],r['id']) for r in myList ]
endtransform = time.time()
print "done transform. elapsed: " + str(endtransform - starttransform)
hits.sort()
endslottsort = time.time()
print "sorting one key slott's way takes " + str(endslottsort - endtransform)
Now the results for the original method, or at least a close version with some instrumentation added:
done creating data
done transform. elapsed: 8.125
about to get stuff to be sorted
done getting data. elapsed time: 37.5939998627
about to sort key hits
done sorting on key <hits> elapsed time: 5.54699993134
Here's the code:
for k, v in myLists.iteritems():
time1 = time.time()
print "about to get stuff to be sorted "
tobesorted = myLists[k].items()
time2 = time.time()
print "done getting data. elapsed time: " + str(time2-time1)
print "about to sort key " + str(k)
mysorted[k] = tobesorted.sort( key=itemgetter(1))
time3 = time.time()
print "done sorting on key <" + str(k) + "> elapsed time: " + str(time3-time2)
I would look into using a different sorting algorithm. Something like a Merge Sort might work. Break the list up into smaller lists and sort them individually. Then loop.
Pseudo code:
list1 = [] // sorted separately
list2 = [] // sorted separately
// Recombine sorted lists
result = []
while (list1.hasMoreElements || list2.hasMoreElements):
if (! list1.hasMoreElements):
result.addAll(list2)
break
elseif (! list2.hasMoreElements):
result.AddAll(list1)
break
if (list1.peek < list2.peek):
result.add(list1.pop)
else:
result.add(list2.pop)
Instead of trying to keep your list ordered, maybe you can get by with a heap queue. It lets you push any item, keeping the 'smallest' one at h[0]
, and popping this item (and 'bubbling' the next smallest) is an O(nlogn)
operation.
so, just ask yourself:
do i need the whole list ordered all the time? : use an ordered structure (like Zope's BTree package, as mentioned by Ealdwulf)
or the whole list ordered but only after a day's work of random insertions?: use sort like you're doing, or like S.Lott's answer
or just a few 'smallest' items at any moment? : use heapq