feature-selection

Recursive feature elimination and grid search using scikit-learn

夙愿已清 提交于 2019-11-28 19:41:11
I would like to perform recursive feature elimination with nested grid search and cross-validation for each feature subset using scikit-learn. From the RFECV documentation it sounds like this type of operation is supported using the estimator_params parameter: estimator_params : dict Parameters for the external estimator. Useful for doing grid searches. However, when I try to pass a grid of hyperparameters to the RFECV object from sklearn.datasets import make_friedman1 from sklearn.feature_selection import RFECV from sklearn.svm import SVR X, y = make_friedman1(n_samples=50, n_features=10,

Random Forest Feature Importance Chart using Python

£可爱£侵袭症+ 提交于 2019-11-28 16:22:19
问题 I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used: from sklearn.ensemble import RandomForestRegressor MT= pd.read_csv("MT_reduced.csv") df = MT.reset_index(drop = False) columns2 = df.columns.tolist() # Filter the columns to remove ones we don't want. columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]] # Store the variable we

The easiest way for getting feature names after running SelectKBest in Scikit Learn

只愿长相守 提交于 2019-11-28 08:03:22
I would like to make supervised learning. Until now I know to do supervised learning to all features. However, I would like also to conduct experiment with the K best features. I read the documentation and found the in Scikit learn there is SelectKBest method. Unfortunately, I am not sure how to create new dataframe after finding those best features: Let's assume I would like to conduct experiment with 5 best features: from sklearn.feature_selection import SelectKBest, f_classif select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

Should Feature Selection be done before Train-Test Split or after?

☆樱花仙子☆ 提交于 2019-11-28 07:08:15
问题 Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is that, if only the Training Set chosen from the whole dataset is used for Feature Selection, then the feature selection or feature importance score orders is likely to be dynamically changed with change in random_state of the Train_Test_Split. And if

Linear regression analysis with string/categorical features (variables)?

ⅰ亾dé卋堺 提交于 2019-11-28 03:23:58
Regression algorithms seem to be working on features represented as numbers. For example: This dataset doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price. But now I want to do regression analysis on data that contain categorical features: There are 5 features: District , Condition , Material , Security , Type How can I do regression on this data? Do I have to transform all this string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric

Recursive feature elimination and grid search using scikit-learn

﹥>﹥吖頭↗ 提交于 2019-11-27 12:29:27
问题 I would like to perform recursive feature elimination with nested grid search and cross-validation for each feature subset using scikit-learn. From the RFECV documentation it sounds like this type of operation is supported using the estimator_params parameter: estimator_params : dict Parameters for the external estimator. Useful for doing grid searches. However, when I try to pass a grid of hyperparameters to the RFECV object from sklearn.datasets import make_friedman1 from sklearn.feature

How can I avoid using estimator_params when using RFECV nested within GridSearchCV?

本秂侑毒 提交于 2019-11-27 06:26:52
问题 I'm currently working on recursive feature elimination (RFECV) within a grid search (GridSearchCV) for tree based methods using scikit-learn. To do this, I'm using the current dev version on GitHub (0.17) which allows RFECV to use feature importance from the tree methods to select features to discard. For clarity this means: loop over hyperparameters for the current tree method for each set of parameters perform recursive feature elimination to obtain the optimal number of features report the

How are feature_importances in RandomForestClassifier determined?

人走茶凉 提交于 2019-11-27 02:30:59
I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_ , which works well for me. However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic. There are indeed several ways to get feature "importances". As often, there is no strict

The easiest way for getting feature names after running SelectKBest in Scikit Learn

别来无恙 提交于 2019-11-27 01:45:36
问题 I would like to make supervised learning. Until now I know to do supervised learning to all features. However, I would like also to conduct experiment with the K best features. I read the documentation and found the in Scikit learn there is SelectKBest method. Unfortunately, I am not sure how to create new dataframe after finding those best features: Let's assume I would like to conduct experiment with 5 best features: from sklearn.feature_selection import SelectKBest, f_classif select_k_best