Oversampling: SMOTE for binary and categorical data in Python

允我心安 提交于 2019-12-07 09:00:56

问题


I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?


回答1:


As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it.

For those with an academic interest in this ongoing issue, the paper from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section 6.1.

Update: This feature has been implemented as of 21 Oct, 2018. Service request stands closed now.




回答2:


As per the documentation, this is now possible with the use of SMOTENC. SMOTE-NC is capable of handling a mix of categorical and continuous features.

Here is the code from the documentation

from imblearn.over_sampling import SMOTENC smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0) X_resampled, y_resampled = smote_nc.fit_resample(X, y)




回答3:


So as per documentation SMOTE doesn't support Categorical data in Python yet, and provides continuous outputs.

You can instead employ a workaround where you convert the categorical variables to integers and use SMOTE.

Then use np.round(X_train[categorical_variables]) to convert them back to the respective categorical values.



来源:https://stackoverflow.com/questions/47655813/oversampling-smote-for-binary-and-categorical-data-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!