I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes. In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None. Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my intuition of weights is wrong in this case. Here's the code:
import numpy as np from sklearn import ensemble,metrics, cross_validation, datasets #create a synthetic dataset with unbalanced classes X,y = datasets.make_classification( n_samples=10000, n_features=20, n_informative=4, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=[0.9], flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=0) model = ensemble.RandomForestClassifier() w0=1 #weight associated to 0's w1=1 #weight associated to 1's #I should split train and validation but for the sake of understanding sample_weights I'll skip this step model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y])) preds = model.predict(X) probas = model.predict_proba(X) ACC = metrics.accuracy_score(y,preds) precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1]) fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1]) ROC = metrics.auc(fpr, tpr) cm = metrics.confusion_matrix(y,preds) print "ACCURACY:", ACC print "ROC:", ROC print "F1 Score:", metrics.f1_score(y,preds) print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0) print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0) print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1) print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)
With w0=w1=1 I get, for instance, F1=0.9456. With w0=w1=10 I get, for instance, F1=0.9569. With sample_weights=None I get F1=0.9474.
Thanks,
G