问题
I am running a numerical experiment that requires many iterations. After each iteration, I would like to store the data in a pickle file or pickle-like file in case the program times-out or a data structure becomes tapped. What is the best way to proceed. Here is the skeleton code:
data_dict = {} # maybe a dictionary is not the best choice
for j in parameters: # j = (alpha, beta, gamma) and cycle through
for k in number_of_experiments: # lots of experiments (10^4)
file = open('storage.pkl', 'ab')
data = experiment() # experiment returns some numerical value
# experiment takes ~ 1 seconds, but increase
# as parameters scale
data_dict.setdefault(j, []).append(data)
pickle.dump(data_dict, file)
file.close()
Questions:
- Is shelve a better choice here? Or some other python library that I am not aware?
- I am using data dict because it's easier to code and more flexible if I need to change things as I do more experiments. Would it be a huge advantage to use a pre-allocated array?
- Does opening and closing files affect run time? I do this so that I can check on the progress in addition to the text logs I have set up.
Thank you for all your help!
回答1:
- Assuming you are using
numpy
for your numerical experiments, instead of pickle I would suggest using numpy.savez. - Keep it simple and make optimizations only if it you feel that the script runs too long.
- Opening and closing files does affect the run time, but having a backup is anyway better.
And I would use collections.defaultdict(list)
instead of plain dict
and setdefault
.
回答2:
Shelve is probably not a good choice, however...
You might try using klepto
or joblib
. Both are good at caching results, and can use efficient storage formats.
Both joblib
and klepto
can save your results to a file on disk, or to a directory. Both can also leverage the numpy
internal storage format and/or compression on save… and also save to memory mapped files, if you like.
If you use klepto
, it takes the dictionary key as the filename, and saves the value as the contents. With klepto
, you can also pick whether you want to use pickle
or json
or some other storage format.
Python 2.7.7 (default, Jun 2 2014, 01:33:50)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import klepto
>>> data_dict = klepto.archives.dir_archive('storage', cached=False, serialized=True)
>>> import string
>>> import random
>>> for j in string.ascii_letters:
... for k in range(1000):
... data_dict.setdefault(j, []).append([int(10*random.random()) for i in range(3)])
...
>>>
This will create a directory called storage
that contains pickled files, one for each key of your data_dict
. There are keywords for using memmap
files, and also for compression level. If you choose cached=False
, then instead of dumping to file each time you wrote to data_dict
, you'd write to memory each time… and you could then use data_dict.dump()
to dump to disk whenever you choose… or you could pick a memory limit that when you hit it, you'd dump to disk. Additionally, you can also pick a caching strategy (like lru
or lfu
) for deciding which keys you would purge from memory and dump to disk.
Get klepto
here: https://github.com/uqfoundation
or get joblib
here: https://github.com/joblib/joblib
If you refactor, you could probably come up with a way to do this so it could take advantage of a pre-allocated array. However, it might depend on the profile of how your code runs.
Does opening and closing files affect run time? Yes. If you use klepto
, you can set the granularity of when you want to dump to disk. Then you can pick a trade-off of speed versus intermediate storage of results.
来源:https://stackoverflow.com/questions/24457060/frequently-updating-stored-data-for-a-numerical-experiment-using-python