Is there class weight (or alternative way) for GradientBoostingClassifier in Sklearn when dealing with VotingClassifier or Grid search?

限于喜欢 提交于 2020-12-30 06:39:09

问题


I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?


回答1:


Currently there isn't a way to use class_weights for GB in sklearn.

Don't confuse this with sample_weight

Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.

Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)




回答2:


Very late, but I hope it can be useful for other members.

In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:

For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.

And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.

In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):

They are generally a bit more sensitive to parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.

Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.

Some updates:

I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool (default: False).

If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.




回答3:


Yes, there is sample_weight in fit method

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.fit

sample_weight : array-like, shape = [n_samples] or None

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

Simply pass per-sample weights depending on their class



来源:https://stackoverflow.com/questions/35539937/is-there-class-weight-or-alternative-way-for-gradientboostingclassifier-in-skl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!