Key: value store in Python for possibly 100 GB of data, without client/server

前端 未结 6 642
猫巷女王i
猫巷女王i 2020-12-25 14:26

There are many solutions to serialize a small dictionary: json.loads/json.dumps, pickle, shelve, ujson, or e

6条回答
  •  再見小時候
    2020-12-25 14:58

    First, bsddb (or under it's new name Oracle BerkeleyDB) is not deprecated.

    From experience LevelDB / RocksDB / bsddb are slower than wiredtiger, that's why I recommend wiredtiger.

    wiredtiger is the storage engine for mongodb so it's well tested in production. There is little or no use of wiredtiger in Python outside my AjguDB project; I use wiredtiger (via AjguDB) to store and query wikidata and concept which around 80GB.

    Here is an example class that allows mimick the python2 shelve module. Basically, it's a wiredtiger backend dictionary where keys can only be strings:

    import json
    
    from wiredtiger import wiredtiger_open
    
    
    WT_NOT_FOUND = -31803
    
    
    class WTDict:
        """Create a wiredtiger backed dictionary"""
    
        def __init__(self, path, config='create'):
            self._cnx = wiredtiger_open(path, config)
            self._session = self._cnx.open_session()
            # define key value table
            self._session.create('table:keyvalue', 'key_format=S,value_format=S')
            self._keyvalue = self._session.open_cursor('table:keyvalue')
    
        def __enter__(self):
            return self
    
        def close(self):
            self._cnx.close()
    
        def __exit__(self, *args, **kwargs):
            self.close()
    
        def _loads(self, value):
            return json.loads(value)
    
        def _dumps(self, value):
            return json.dumps(value)
    
        def __getitem__(self, key):
            self._session.begin_transaction()
            self._keyvalue.set_key(key)
            if self._keyvalue.search() == WT_NOT_FOUND:
                raise KeyError()
            out = self._loads(self._keyvalue.get_value())
            self._session.commit_transaction()
            return out
    
        def __setitem__(self, key, value):
            self._session.begin_transaction()
            self._keyvalue.set_key(key)
            self._keyvalue.set_value(self._dumps(value))
            self._keyvalue.insert()
            self._session.commit_transaction()
    

    Here the adapted test program from @saaj answer:

    #!/usr/bin/env python3
    
    import os
    import random
    
    import lipsum
    from wtdict import WTDict
    
    
    def main():
        with WTDict('wt') as wt:
            for _ in range(100000):
                v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
                wt[os.urandom(10)] = v
    
    if __name__ == '__main__':
        main()
    

    Using the following command line:

    python test-wtdict.py & psrecord --plot=plot.png --interval=0.1 $!
    

    I generated the following diagram:

    $ du -h wt
    60M wt
    

    When write-ahead-log is active:

    $ du -h wt
    260M    wt
    

    This is without performance tunning and compression.

    Wiredtiger has no known limit until recently, the documentation was updated to the following:

    WiredTiger supports petabyte tables, records up to 4GB, and record numbers up to 64-bits.

    http://source.wiredtiger.com/1.6.4/architecture.html

提交回复
热议问题