Best way to sort 1M records in Python

前端 未结 11 1608
慢半拍i
慢半拍i 2020-12-05 08:41

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following

myHashTable = {}
myLists = { \'hits\':{}, \'misses\':{},          


        
相关标签:
11条回答
  • 2020-12-05 09:33

    Honestly, the best way is to not use Python. If performance is a major concern for this, use a faster language.

    0 讨论(0)
  • 2020-12-05 09:38

    You may find this related answer from Guido: Sorting a million 32-bit integers in 2MB of RAM using Python

    0 讨论(0)
  • 2020-12-05 09:39

    Others have provided some excellent advices, try them out.

    As a general advice, in situations like that you need to profile your code. Know exactly where most of the time is spent. Bottlenecks hide well, in places you least expect them to be.
    If there is a lot of number crunching involved then a JIT compiler like the (now-dead) psyco might also help. When processing takes minutes or hours 2x speed-up really counts.

    • http://docs.python.org/library/profile.html
    • http://www.vrplumber.com/programming/runsnakerun/
    • http://psyco.sourceforge.net/
    0 讨论(0)
  • 2020-12-05 09:40

    What you really want is an ordered container, instead of an unordered one. That would implicitly sort the results as they're inserted. The standard data structure for this is a tree.

    However, there doesn't seem to be one of these in Python. I can't explain that; this is a core, fundamental data type in any language. Python's dict and set are both unordered containers, which map to the basic data structure of a hash table. It should definitely have an optimized tree data structure; there are many things you can do with them that are impossible with a hash table, and they're quite tricky to implement well, so people generally don't want to be doing it themselves.

    (There's also nothing mapping to a linked list, which also should be a core data type. No, a deque is not equivalent.)

    I don't have an existing ordered container implementation to point you to (and it should probably be implemented natively, not in Python), but hopefully this will point you in the right direction.

    A good tree implementation should support iterating across a range by value ("iterate all values from [2,100] in order"), find next/prev value from any other node in O(1), efficient range extraction ("delete all values in [2,100] and return them in a new tree"), etc. If anyone has a well-optimized data structure like this for Python, I'd love to know about it. (Not all operations fit nicely in Python's data model; for example, to get next/prev value from another value, you need a reference to a node, not the value itself.)

    0 讨论(0)
  • 2020-12-05 09:42
    sorted(myLists[key], key=mylists[key].get, reverse=True)
    

    should save you some time, though not a lot.

    0 讨论(0)
提交回复
热议问题