问题
We have trained an Extra Tree model for some regression task. Our model consists of 3 extra trees, each having 200 trees of depth 30. On top of the 3 extra trees, we use a ridge regression. We train our model for several hours and pickle the trained model (the entire class object), for later use. However, the size of saved trained model is too big, about 140 GB! Is there a way to reduce the size of the saved model? are there any configuration in pickle that could be helpful, or any alternative for pickle?
回答1:
In the best case (binary trees), you will have 3 * 200 * (2^30 - 1) = 644245094400 nodes or 434Gb assuming each one node would only cost 1 byte to store. I think that 140GB is a pretty decent size in comparision.
Edit: Bad maths.
回答2:
You can try using joblib with compression parameter.
from sklearn.externals import joblib
joblib.dump(your_algo, 'pickle_file_name.pkl',compress=3)
compress - from 0 to 9. Higher value means more compression, but also slower read and write times. Using a value of 3 is often a good compromise.
You can use python standard compression modules zlib, gzip, bz2, lzma and xz. To use that you can just specify the format with specific extension
example
joblib.dump(obj, 'your_filename.pkl.z') # zlib
More information, see the [link]:(http://gael-varoquaux.info/programming/new_low-overhead_persistence_in_joblib_for_big_data.html)
来源:https://stackoverflow.com/questions/43591621/trained-machine-learning-model-is-too-big