问题
I am running a numerical experiment that requires many iterations. After each iteration, I would like to store the data in a pickle file or pickle-like file in case the program times-out or a data structure becomes tapped. What is the best way to proceed. Here is the skeleton code:
data_dict = {} # maybe a dictionary is not the best choice
for j in parameters: # j = (alpha, beta, gamma) and cycle through
for k in number_of_experiments: # lots of experiments (10^4)
file = open('storage.pkl', 'ab')
data = experiment() # experiment returns some numerical value
# experiment takes ~ 1 seconds, but increase
# as parameters scale
data_dict.setdefault(j, []).append(data)
pickle.dump(data_dict, file)
file.close()
Questions:
- Is shelve a better choice here? Or some other python library that I am not aware?
- I am using data dict because it's easier to code and more flexible if I need to change things as I do more experiments. Would it be a huge advantage to use a pre-allocated array?
- Does opening and closing files affect run time? I do this so that I can check on the progress in addition to the text logs I have set up.
Thank you for all your help!
回答1:
- Assuming you are using
numpyfor your numerical experiments, instead of pickle I would suggest using numpy.savez. - Keep it simple and make optimizations only if it you feel that the script runs too long.
- Opening and closing files does affect the run time, but having a backup is anyway better.
And I would use collections.defaultdict(list) instead of plain dict and setdefault.
回答2:
Shelve is probably not a good choice, however...
You might try using klepto or joblib. Both are good at caching results, and can use efficient storage formats.
Both joblib and klepto can save your results to a file on disk, or to a directory. Both can also leverage the numpy internal storage format and/or compression on save… and also save to memory mapped files, if you like.
If you use klepto, it takes the dictionary key as the filename, and saves the value as the contents. With klepto, you can also pick whether you want to use pickle or json or some other storage format.
Python 2.7.7 (default, Jun 2 2014, 01:33:50)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import klepto
>>> data_dict = klepto.archives.dir_archive('storage', cached=False, serialized=True)
>>> import string
>>> import random
>>> for j in string.ascii_letters:
... for k in range(1000):
... data_dict.setdefault(j, []).append([int(10*random.random()) for i in range(3)])
...
>>>
This will create a directory called storage that contains pickled files, one for each key of your data_dict. There are keywords for using memmap files, and also for compression level. If you choose cached=False, then instead of dumping to file each time you wrote to data_dict, you'd write to memory each time… and you could then use data_dict.dump() to dump to disk whenever you choose… or you could pick a memory limit that when you hit it, you'd dump to disk. Additionally, you can also pick a caching strategy (like lru or lfu) for deciding which keys you would purge from memory and dump to disk.
Get klepto here: https://github.com/uqfoundation
or get joblib here: https://github.com/joblib/joblib
If you refactor, you could probably come up with a way to do this so it could take advantage of a pre-allocated array. However, it might depend on the profile of how your code runs.
Does opening and closing files affect run time? Yes. If you use klepto, you can set the granularity of when you want to dump to disk. Then you can pick a trade-off of speed versus intermediate storage of results.
来源:https://stackoverflow.com/questions/24457060/frequently-updating-stored-data-for-a-numerical-experiment-using-python