*large* python dictionary with persistence storage for quick look-ups

前端 未结 6 1731
心在旅途
心在旅途 2020-12-08 21:05

I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this.

相关标签:
6条回答
  • 2020-12-08 21:26

    No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.

    From the docs https://docs.python.org/3/library/dbm.html

    import dbm
    
    # Open database, creating it if necessary.
    with dbm.open('cache', 'c') as db:
    
        # Record some values
        db[b'hello'] = b'there'
        db['www.python.org'] = 'Python Website'
        db['www.cnn.com'] = 'Cable News Network'
    
        # Note that the keys are considered bytes now.
        assert db[b'www.python.org'] == b'Python Website'
        # Notice how the value is now in bytes.
        assert db['www.cnn.com'] == b'Cable News Network'
    
        # Often-used methods of the dict interface work too.
        print(db.get('python.org', b'not present'))
    
        # Storing a non-string key or value will raise an exception (most
        # likely a TypeError).
        db['www.yahoo.com'] = 4
    
    # db is automatically closed when leaving the with statement.
    

    I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.

    Cheers

    Tim

    0 讨论(0)
  • 2020-12-08 21:28

    If you want to persist a large dictionary, you are basically looking at a database.

    Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.

    0 讨论(0)
  • 2020-12-08 21:32

    Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.

    1. Install redis-server
    2. Start redis server
    3. Install redis python pacakge (pip install redis)
    4. Profit.

    import redis
    
    ds = redis.Redis(host="localhost", port=6379)
    
    with open("your_text_file.txt") as fh:
        for line in fh:
            line = line.strip()
            k, _, v = line.partition("=")
            ds.set(k, v)
    

    Above assumes a files of values like:

    key1=value1
    key2=value2
    etc=etc
    

    Modify insertion script to your needs.


    import redis
    ds = redis.Redis(host="localhost", port=6379)
    
    # Do your code that needs to do look ups of keys:
    for mykey in special_key_list:
        val = ds.get(mykey)
    

    Why I like Redis.

    1. Configurable persistance options
    2. Blazingly fast
    3. Offers more than just key / value pairs (other data types)
    4. @antrirez
    0 讨论(0)
  • 2020-12-08 21:34

    I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.

    This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.

    What are the performance characteristics of sqlite with very large database files?

    0 讨论(0)
  • 2020-12-08 21:39

    In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.

    Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.

    Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.

    0 讨论(0)
  • 2020-12-08 21:42

    I personally use LMDB and its python binding for a few million records DB. It is extremely fast even for a database larger than the RAM. It's embedded in the process so no server is needed. Dependency are managed using pip.

    The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.

    0 讨论(0)
提交回复
热议问题