问题
I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally. Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model.
rf = RandomForestRegressor(n_estimators=100)
print ("Trying to fit the Random Forest model --> ")
if os.path.exists('rf.pkl'):
print ("Trained model already pickled -- >")
with open('rf.pkl', 'rb') as f:
rf = cPickle.load(f)
else:
df_x_train = x_train[col_feature]
rf.fit(df_x_train,y_train)
print ("Training for the model done ")
with open('rf.pkl', 'wb') as f:
cPickle.dump(rf, f)
df_x_test = x_test[col_feature]
pred = rf.predict(df_x_test)
EDIT 1: I don't have the compute capacity to train the model on 4 years of data all at once.
回答1:
What you're talking about, updating a model with additional data incrementally, is discussed in the sklearn User Guide:
Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.
They include a list of classifiers and regressors implementing partial_fit(), but RandomForest is not among them. You can also confirm RFRegressor does not implement partial fit on the documentation page for RandomForestRegressor.
Some possible ways forward:
- Use a regressor which does implement
partial_fit(), such as SGDRegressor - Check your RandomForest model's
feature_importances_attribute, then retrain your model on 3 or 4 years of data after dropping unimportant features - Train your model on only the most recent two years of data, if you can only use two years
- Train your model on a random subset drawn from all four years of data.
- Change the
tree_depthparameter to constrain how complicated your model can get. This saves computation time and so may allow you to use all your data. It can also prevent overfitting. Use Cross-Validation to select the best tree-depth hyperparameter for your problem - Set your RF model's param
n_jobs=-1if you haven't already,to use multiple cores/processors on your machine. - Use a faster ensemble-tree-based algorithm, such as xgboost
- Run your model-fitting code on a large machine in the cloud, such as AWS or dominodatalab
回答2:
You can set the 'warm_start' parameter to True in the model. This will ensure the retention of learning with previous learn using fit call.
Same model learning incrementally two times (train_X[:1], train_X[1:2]) after setting ' warm_start '
forest_model = RandomForestRegressor(warm_start=True)
forest_model.fit(train_X[:1],train_y[:1])
pred_y = forest_model.predict(val_X[:1])
mae = mean_absolute_error(pred_y,val_y[:1])
print("mae :",mae)
print('pred_y :',pred_y)
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 1290000.0 pred_y : [ 1630000.] mae : 925000.0 pred_y : [ 1630000.]
Model only with the last learnt values ( train_X[1:2] )
forest_model = RandomForestRegressor()
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 515000.0 pred_y : [ 1220000.]
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
来源:https://stackoverflow.com/questions/44060432/incremental-training-of-random-forest-model-using-python-sklearn