Fastest way to store large files in Python

前端 未结 5 2113
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-04 09:45

I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing

5条回答
  •  南旧
    南旧 (楼主)
    2021-02-04 10:33

    I'd just expand on phihag's answer.

    When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.

    But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.

    For tuples composed of strings (requires strings to contain no commas):

    import ujson as json
    from bz2 import BZ2File
    
    bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
    bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()]) 
    
    f = BZ2File('filename.json.bz2',mode='wb')
    json.dump(bigdata,f)
    f.close()
    

    To re-compose the tuples:

    bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])
    

    Alternatively if e.g. your keys are 2-tuples of integers:

    bigdata2 = { (1,2): 1.2, (2,3): 3.4}
    bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
    # ... save, load ...
    bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])
    

    Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.

提交回复
热议问题