Python Disk-Based Dictionary

只愿长相守 提交于 2019-11-28 04:06:54
Parand

Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.

The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module

from shove import Shove

mem_store = Shove()
file_store = Shove('file://mystore')

file_store['key'] = value

Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.

A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.

The shelve module may do it; at any rate, it should be simple to test. Instead of:

self.lengths = {}

do:

import shelve
self.lengths = shelve.open('lengths.shelf')

The only catch is that keys to shelves must be strings, so you'll have to replace

self.lengths[indx]

with

self.lengths[str(indx)]

(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)

There's no built-in caching in memory, but your operating system may do that for you anyway.

[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]

With a little bit of thought it seems like you could get the shelve module to do what you want.

I've read you think shelve is too slow and you tried to hack your own dict using sqlite.

Another did this too :

http://sebsauvage.net/python/snyppets/index.html#dbdict

It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?

read answer for this question from GvR ;) Sorting a million 32-bit integers in 2MB of RAM using Python

You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.

I did not try it yet but Hamster DB is promising and has a Python interface.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!