Combining random forest models in scikit learn

后端 未结 2 1508
面向向阳花
面向向阳花 2020-12-08 11:21

I have two RandomForestClassifier models, and I would like to combine them into one meta model. They were both trained using similar, but different, data. How can I do this?

相关标签:
2条回答
  • 2020-12-08 11:45

    I believe this is possible by modifying the estimators_ and n_estimators attributes on the RandomForestClassifier object. Each tree in the forest is stored as a DecisionTreeClassifier object, and the list of these trees is stored in the estimators_ attribute. To make sure there is no discontinuity, it also makes sense to change the number of estimators in n_estimators.

    The advantage of this method is that you could build a bunch of small forests in parallel across multiple machines and combine them.

    Here's an example using the iris data set:

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.cross_validation import train_test_split
    from sklearn.datasets import load_iris
    
    def generate_rf(X_train, y_train, X_test, y_test):
        rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
        rf.fit(X_train, y_train)
        print "rf score ", rf.score(X_test, y_test)
        return rf
    
    def combine_rfs(rf_a, rf_b):
        rf_a.estimators_ += rf_b.estimators_
        rf_a.n_estimators = len(rf_a.estimators_)
        return rf_a
    
    iris = load_iris()
    X, y = iris.data[:, [0,1,2]], iris.target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
    # in the line below, we create 10 random forest classifier models
    rfs = [generate_rf(X_train, y_train, X_test, y_test) for i in xrange(10)]
    # in this step below, we combine the list of random forest models into one giant model
    rf_combined = reduce(combine_rfs, rfs)
    # the combined model scores better than *most* of the component models
    print "rf combined score", rf_combined.score(X_test, y_test)
    
    0 讨论(0)
  • 2020-12-08 11:48

    In addition to @mgoldwasser solution, an alternative is to make use of warm_start when training your forest. In Scikit-Learn 0.16-dev, you can now do the following:

    # First build 100 trees on X1, y1
    clf = RandomForestClassifier(n_estimators=100, warm_start=True)
    clf.fit(X1, y1)
    
    # Build 100 additional trees on X2, y2
    clf.set_params(n_estimators=200)
    clf.fit(X2, y2)
    
    0 讨论(0)
提交回复
热议问题