问题
I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict
which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.
I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.
Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?
回答1:
Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.
回答2:
The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module
from shove import Shove
mem_store = Shove()
file_store = Shove('file://mystore')
file_store['key'] = value
回答3:
Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.
A thin wrapper that does SQLite queries in __getitem__
and __setitem__
isn't much code to write.
回答4:
The shelve module may do it; at any rate, it should be simple to test. Instead of:
self.lengths = {}
do:
import shelve
self.lengths = shelve.open('lengths.shelf')
The only catch is that keys to shelves must be strings, so you'll have to replace
self.lengths[indx]
with
self.lengths[str(indx)]
(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)
There's no built-in caching in memory, but your operating system may do that for you anyway.
[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]
回答5:
With a little bit of thought it seems like you could get the shelve module to do what you want.
回答6:
I've read you think shelve is too slow and you tried to hack your own dict using sqlite.
Another did this too :
http://sebsauvage.net/python/snyppets/index.html#dbdict
It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?
回答7:
read answer for this question from GvR ;) Sorting a million 32-bit integers in 2MB of RAM using Python
回答8:
You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.
回答9:
For simple use cases sqlitedict can help. However when you have much more complex databases you might one to try one of the more upvoted answers.
来源:https://stackoverflow.com/questions/226693/python-disk-based-dictionary