Frequently Updating Stored Data for a Numerical Experiment using Python [closed]

一个人想着一个人 提交于 2019-12-01 01:52:57
  1. Assuming you are using numpy for your numerical experiments, instead of pickle I would suggest using numpy.savez.
  2. Keep it simple and make optimizations only if it you feel that the script runs too long.
  3. Opening and closing files does affect the run time, but having a backup is anyway better.

And I would use collections.defaultdict(list) instead of plain dict and setdefault.

Shelve is probably not a good choice, however...

You might try using klepto or joblib. Both are good at caching results, and can use efficient storage formats.

Both joblib and klepto can save your results to a file on disk, or to a directory. Both can also leverage the numpy internal storage format and/or compression on save… and also save to memory mapped files, if you like.

If you use klepto, it takes the dictionary key as the filename, and saves the value as the contents. With klepto, you can also pick whether you want to use pickle or json or some other storage format.

Python 2.7.7 (default, Jun  2 2014, 01:33:50) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import klepto
>>> data_dict = klepto.archives.dir_archive('storage', cached=False, serialized=True)     
>>> import string
>>> import random
>>> for j in string.ascii_letters:
...   for k in range(1000):
...     data_dict.setdefault(j, []).append([int(10*random.random()) for i in range(3)])
... 
>>> 

This will create a directory called storage that contains pickled files, one for each key of your data_dict. There are keywords for using memmap files, and also for compression level. If you choose cached=False, then instead of dumping to file each time you wrote to data_dict, you'd write to memory each time… and you could then use data_dict.dump() to dump to disk whenever you choose… or you could pick a memory limit that when you hit it, you'd dump to disk. Additionally, you can also pick a caching strategy (like lru or lfu) for deciding which keys you would purge from memory and dump to disk.

Get klepto here: https://github.com/uqfoundation

or get joblib here: https://github.com/joblib/joblib

If you refactor, you could probably come up with a way to do this so it could take advantage of a pre-allocated array. However, it might depend on the profile of how your code runs.

Does opening and closing files affect run time? Yes. If you use klepto, you can set the granularity of when you want to dump to disk. Then you can pick a trade-off of speed versus intermediate storage of results.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!