Suggestions for speeding up Random Forests

后端 未结 4 1638
执念已碎
执念已碎 2020-12-01 00:46

I\'m doing some work with the randomForest package and while it works well, it can be time-consuming. Any one have any suggestions for speeding things up? I\'

相关标签:
4条回答
  • 2020-12-01 00:47

    Why don't you use an already parallelized and optimized implementation of Random Forest? Have a look to SPRINT using MPI. http://www.r-sprint.org/

    0 讨论(0)
  • 2020-12-01 00:49

    Is there any particular reason why you're not using Python (namely the scikit-learn and multiprocessing modules) to implement this? Using joblib, I've trained random forests on datasets of similar size in a fraction of the time it takes in R. Even without multiprocessing, random forests are significantly faster in Python. Here's a quick example of training a RF classifier and cross validating in Python. You can also easily extract feature importances and visualize the trees.

    import numpy as np
    from sklearn.metrics import *
    from sklearn.cross_validation import StratifiedKFold
    from sklearn.ensemble import RandomForestClassifier
    
    #assuming that you have read in data with headers
    #first column corresponds to response variable 
    y = data[1:, 0].astype(np.float)
    X = data[1:, 1:].astype(np.float)
    
    cm = np.array([[0, 0], [0, 0]])
    precision = np.array([])
    accuracy = np.array([])
    sensitivity = np.array([])
    f1 = np.array([])
    matthews = np.array([])
    
    rf = RandomForestClassifier(n_estimators=100, max_features = 5, n_jobs = 2)
    
    #divide dataset into 5 "folds", where classes are equally balanced in each fold
    cv = StratifiedKFold(y, n_folds = 5)
    for i, (train, test) in enumerate(cv):
            classes = rf.fit(X[train], y[train]).predict(X[test])
            precision = np.append(precision, (precision_score(y[test], classes)))
            accuracy = np.append(accuracy, (accuracy_score(y[test], classes)))
            sensitivity = np.append(sensitivity, (recall_score(y[test], classes)))
            f1 = np.append(f1, (f1_score(y[test], classes)))
            matthews = np.append(matthews, (matthews_corrcoef(y[test], classes)))
            cm = np.add(cm, (confusion_matrix(y[test], classes)))
    
    print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2))
    print("Precision: %0.2f (+/- %0.2f)" % (precision.mean(), precision.std() * 2))
    print("Sensitivity: %0.2f (+/- %0.2f)" % (sensitivity.mean(), sensitivity.std() * 2))
    print("F1: %0.2f (+/- %0.2f)" % (f1.mean(), f1.std() * 2))
    print("Matthews: %0.2f (+/- %0.2f)" % (matthews.mean(), matthews.std() * 2))
    print(cm)
    
    0 讨论(0)
  • 2020-12-01 01:01

    There are two 'out of the box' options that address this problem. First, the caret package contains a method 'parRF' that handles this elegantly. I commonly use this with 16 cores to great effect. The randomShrubbery package also takes advantages of multiple cores for RF on Revolution R.

    0 讨论(0)
  • 2020-12-01 01:11

    The manual of the foreach package has a section on Parallel Random Forests (Using The foreach Package, Section 5.1):

    > library("foreach")
    > library("doSNOW")
    > registerDoSNOW(makeCluster(4, type="SOCK"))
    
    > x <- matrix(runif(500), 100)
    > y <- gl(2, 50)
    
    > rf <- foreach(ntree = rep(250, 4), .combine = combine, .packages = "randomForest") %dopar%
    +    randomForest(x, y, ntree = ntree)
    > rf
    Call:
    randomForest(x = x, y = y, ntree = ntree)
    Type of random forest: classification
    Number of trees: 1000
    

    If we want want to create a random forest model with a 1000 trees, and our computer has four cores, we can split up the problem into four pieces by executing the randomForest function four times, with the ntree argument set to 250. Of course, we have to combine the resulting randomForest objects, but the randomForest package comes with a function called combine.

    0 讨论(0)
提交回复
热议问题