Sampling before or after feature selection

泪湿孤枕 提交于 2021-02-08 06:32:48

问题


I am confused on the order of feature selection, sampling and cross validation, My dataset has 468 rows and 23000 columns, out of which 269 belong to class I and 199 belong to class II , The data when split to train and test has [215 class I and 159 class II in train ][54 class I and 40 class II in test].Due to less number of samples I had to apply SMOTE oversampling on the train data to reduce bias. Or should I apply Under Sampling here which leads to data loss resulting in much smaller samples. I) Apply over sampling first and then feature selection technique and then cross validation On doing so: During Cross validation there might be bias induced due to repetition of rows due to over sampling II) Apply Feature selection technique first and do over sampling and then do cross validation, which will induce the same bias as above. III) Apply feature selection techniques first and inside a 10-fold cross validation perform sampling on the 9 folds’ data. IV) Start with cross validation and inside each iteration perform feature selection and then perform over sampling on the selected feature data. V) Start with cross validation and inside each iteration perform sampling on the 9 fold data and perform feature selection on that 9 fold sampled data

Which techniques is the correct methods and also provides good results.


回答1:


The SMOTE paper describes that the feature selection should be performed before sampling.



来源:https://stackoverflow.com/questions/63375860/sampling-before-or-after-feature-selection

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!