Fastest way to store large files in Python

前端未结

关注

 5  2113

佛祖请我去吃肉 2021-02-04 09:45

I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing

5条回答

南旧 (楼主)

2021-02-04 10:33
I'd just expand on phihag's answer.

When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.

But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.

For tuples composed of strings (requires strings to contain no commas):
```
import ujson as json
from bz2 import BZ2File

bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()]) 

f = BZ2File('filename.json.bz2',mode='wb')
json.dump(bigdata,f)
f.close()
```
To re-compose the tuples:
```
bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])
```
Alternatively if e.g. your keys are 2-tuples of integers:
```
bigdata2 = { (1,2): 1.2, (2,3): 3.4}
bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
# ... save, load ...
bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])
```
Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...