Pickle versus shelve storing large dictionaries in Python

后端 未结 2 1907
我在风中等你
我在风中等你 2020-12-05 14:05

If I am storing a large directory as a pickle file, does loading it via cPickle mean that it will all be consumed into memory at once?

If s

2条回答
  •  天命终不由人
    2020-12-05 14:35

    If you want a module that's more robust than shelve, you might look at klepto. klepto is built to provide a dictionary interface to platform-agnostic storage on disk or database, and is built to work with large data.

    Here, we first create some pickled objects stored on disk. They use the dir_archive, which stores one object per file.

    >>> d = dict(zip('abcde',range(5)))
    >>> d['f'] = max
    >>> d['g'] = lambda x:x**2
    >>> 
    >>> import klepto
    >>> help(klepto.archives.dir_archive)       
    
    >>> print klepto.archives.dir_archive.__new__.__doc__
    initialize a dictionary with a file-folder archive backend
    
        Inputs:
            name: name of the root archive directory [default: memo]
            dict: initial dictionary to seed the archive
            cached: if True, use an in-memory cache interface to the archive
            serialized: if True, pickle file contents; otherwise save python objects
            compression: compression level (0 to 9) [default: 0 (no compression)]
            memmode: access mode for files, one of {None, 'r+', 'r', 'w+', 'c'}
            memsize: approximate size (in MB) of cache for in-memory compression
    
    >>> a = klepto.archives.dir_archive(dict=d)
    >>> a
    dir_archive('memo', {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g':  at 0x102f562a8>, 'f': }, cached=True)
    >>> a.dump()
    >>> del a
    

    Now, the data is all on disk, let's pick and choose which ones we want to load in to memory. b is the dict in memory, while b.archive maps the collection of files into a dictionary view.

    >>> b = klepto.archives.dir_archive('memo')
    >>> b
    dir_archive('memo', {}, cached=True)
    >>> b.keys()   
    []
    >>> b.archive.keys()
    ['a', 'c', 'b', 'e', 'd', 'g', 'f']
    >>> b.load('a')
    >>> b
    dir_archive('memo', {'a': 0}, cached=True)
    >>> b.load('b')
    >>> b.load('f')
    >>> b.load('g')
    >>> b['g'](b['f'](b['a'],b['b']))
    1
    

    klepto also provides the same interface to a sql archive.

    >>> print klepto.archives.sql_archive.__new__.__doc__
    initialize a dictionary with a sql database archive backend
    
        Connect to an existing database, or initialize a new database, at the
        selected database url. For example, to use a sqlite database 'foo.db'
        in the current directory, database='sqlite:///foo.db'. To use a mysql
        database 'foo' on localhost, database='mysql://user:pass@localhost/foo'.
        For postgresql, use database='postgresql://user:pass@localhost/foo'. 
        When connecting to sqlite, the default database is ':memory:'; otherwise,
        the default database is 'defaultdb'. If sqlalchemy is not installed,
        storable values are limited to strings, integers, floats, and other
        basic objects. If sqlalchemy is installed, additional keyword options
        can provide database configuration, such as connection pooling.
        To use a mysql or postgresql database, sqlalchemy must be installed.
    
        Inputs:
            name: url for the sql database [default: (see note above)]
            dict: initial dictionary to seed the archive
            cached: if True, use an in-memory cache interface to the archive
            serialized: if True, pickle table contents; otherwise cast as strings
    
    >>> c = klepto.archives.sql_archive('database')
    >>> c.update(b)
    >>> c
    sql_archive('sqlite:///database', {'a': 0, 'b': 1, 'g':  at 0x10446b1b8>, 'f': }, cached=True)
    >>> c.dump()
    

    Where now the same objects on disk are also in a sql archive. We can add new objects to either archive.

    >>> b['x'] = 69
    >>> c['y'] = 96
    >>> b.dump('x')
    >>> c.dump('y')
    

    Get klepto here: https://github.com/uqfoundation

提交回复
热议问题