问题
I am trying to implement bag of word model from kaggle site with a twitter sentiments data which has around 1M raw. I already clean it but in last part when I applied my features vectors and sentiments to Random Forest classifier it is taking so much time.here is my code...
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100,verbose=3)
forest = forest.fit( train_data_features, train["Sentiment"] )
train_data_features is 1048575x5000 sparse matrix.I tried to converted it into an array while doing it indicates a memory error.
Where am I doing wrong?Can some suggest me some source or another way to do it faster?I absolutely novice in machine learning and not have that much programming background so some guide will accommodate.
Much thanks to you in advance
回答1:
Actually the solution is pretty straight forward: get strong machine and run it in parallel. By default RandomForestClassifier uses a single thread, but since it is an ensemble of completely independent models you can train each of these 100 tress in parallel. Just set
forest = RandomForestClassifier(n_estimators = 100,verbose=3,n_jobs=-1)
to use all of your cores. You can also limit max_depth
which will speed things up (in the end you will probably need this either way, since RF can overfit badly without any limitation on depth).
来源:https://stackoverflow.com/questions/43640546/how-to-make-randomforestclassifier-faster