How to use warm_start

后端 未结 4 1336
栀梦
栀梦 2020-12-25 15:51

I\'d like to use the warm_start parameter to add training data to my random forest classifier. I expected it to be used like this:

clf = RandomF         


        
相关标签:
4条回答
  • 2020-12-25 15:55

    The basic pattern of (taken from Miriam's answer):

    clf = RandomForestClassifier(warm_start=True)
    clf.fit(get_data())
    clf.fit(get_more_data())
    

    would be the correct usage API-wise.

    But there is an issue here.

    As the docs say the following:

    When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

    it means, that the only thing warm_start can do for you, is adding new DecisionTree's. All the previous trees seem to be untouched!

    Let's check this with some sources:

      n_more_estimators = self.n_estimators - len(self.estimators_)
    
        if n_more_estimators < 0:
            raise ValueError('n_estimators=%d must be larger or equal to '
                             'len(estimators_)=%d when warm_start==True'
                             % (self.n_estimators, len(self.estimators_)))
    
        elif n_more_estimators == 0:
            warn("Warm-start fitting without increasing n_estimators does not "
                 "fit new trees.")
    

    This basically tells us, that you would need to increase the number of estimators before approaching a new fit!

    I have no idea what kind of usage sklearn expects here. I'm not sure, if fitting, increasing internal variables and fitting again is correct usage, but i somehow doubt it (especially as n_estimators is not a public class-variable).

    Your basic approach (in regards to this library and this classifier) is probably not a good idea for your out-of-core learning here! I would not pursue this further.

    0 讨论(0)
  • 2020-12-25 15:55

    Just to add to excellent @sascha`s answer, this hackie method works:

    rf = RandomForestClassifier(n_estimators=1, warm_start=True)                     
    rf.fit(X_train, y_train)
    rf.n_estimators += 1
    rf.fit(X_train, y_train) 
    
    0 讨论(0)
  • 2020-12-25 15:56
    from sklearn.datasets import load_iris
    boston = load_iris()
    X, y = boston.data, boston.target
    
    ### RandomForestClassifier
    from sklearn.ensemble import RandomForestClassifier
    rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
    rfc.fit(X[:50], y[:50])
    print(rfc.score(X, y))
    rfc.n_estimators += 10
    rfc.fit(X[51:100], y[51:100])
    print(rfc.score(X, y))
    rfc.n_estimators += 10
    rfc.fit(X[101:150], y[101:150])
    print(rfc.score(X, y))
    

    Below is differentiation between warm_start and partial_fit.

    When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit. Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.

    partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.

    There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.

    0 讨论(0)
  • 2020-12-25 15:57

    All warm_start does boils down to preserving the state of the previous train.


    It differs from a partial_fit in that the idea is not to incrementally learn on small batches of data, but rather to re-use a trained model in its previous state. Namely the difference between a regular call to fit and a fit having set warm_start=True is that the estimator state is not cleared, see _clear_state

    if not self.warm_start:
        self._clear_state()
    

    Which, among other parameters, would initialize all estimators:

    if hasattr(self, 'estimators_'):
        self.estimators_ = np.empty((0, 0), dtype=np.object)
    

    So having set warm_start=True in each subsequent call to fit will not initialize the trainable parameters, instead it will start from their previous state and add new estimators to the model.


    Which means that one could do:

    grid1={'bootstrap': [True, False],
     'max_depth': [10, 20, 30, 40, 50, 60],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]}
    
    rf_grid_search1 = GridSearchCV(estimator = RandomForestClassifier(), 
                                   param_distributions = grid1,
                                   cv = 3,
                                   random_state=12)
    rf_grid_search1.fit(X_train, y_train)
    

    Then fit a model on the best parameters and set warm_start=True:

    rf = RandomForestClassifier(**rf_grid_search1.best_params_, warm_start=True)
    rf.fit(X_train, y_train)
    

    Then we could perform GridSearch only on say n_estimators:

    grid2 = {'n_estimators': [200, 400, 600, 800, 1000]}
    rf_grid_search2 = GridSearchCV(estimator = rf,
                                   param_distributions = grid2,
                                   cv = 3, 
                                   random_state=12,
                                   n_iter=4)
    rf_grid_search2.fit(X_train, y_train)
    

    The advantage here is that the estimators would already be fit with the previous parameter setting, and with each subsequent call to fit, the model will be starting from the previous parameters, and we're just analyzing if adding new estimators would benefit the model.

    0 讨论(0)
提交回复
热议问题