feature-selection | 易学教程

R using my own model in RFE(recursive feature elimination) to pick important feature

阅读更多关于 R using my own model in RFE(recursive feature elimination) to pick important feature

问题 Using RFE, you can get a importance rank of the features, but right now I can only use the model and parameter inner the package like: lmFuncs(linear model),rfFuncs(random forest) it seems that caretFuncs can do some custom settings for your own model and parameter,but I don't know the details and the formal document didn't give detail, I want to apply svm and gbm to this RFE process,because this is the current model I used to train, anyone has any idea? 回答1: I tried to recreate working

Optimizing number of optimum features

阅读更多关于 Optimizing number of optimum features

问题 I am training neural network using Keras. Every time I train my model, I use slightly different set of features selected using Tree-based feature selection via ExtraTreesClassifier() . After training every time, I compute the AUCROC on my validation set and then go back in a loop to train the model again with different set of feature. This process is very inefficient and I want to select the optimum number of features using some optimization technique available in some python library. The

Python: feature selection in sci-kit learn for a normal distribution

阅读更多关于 Python: feature selection in sci-kit learn for a normal distribution

问题 I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most. I created some extra columns to measure the "success" which I define as just % attended relative to invites: my_data['invited'] = my_data.count(axis=1) my_data[

Feature selection on a keras model

阅读更多关于 Feature selection on a keras model

问题 I was trying to find the best features that dominate for the output of my regression model, Following is my code. seed = 7 np.random.seed(seed) estimators = [] estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=3, batch_size=20))) pipeline = Pipeline(estimators) rfe = RFE(estimator= pipeline, n_features_to_select=5) fit = rfe.fit(X_set, Y_set) But I get the following runtime error when running. RuntimeError: The classifier does not expose "coef_" or "feature_importances_

Mapping the index of the feat importances to the index of columns in a dataframe

阅读更多关于 Mapping the index of the feat importances to the index of columns in a dataframe

问题 Hello I plotted a graph using feature_importance from xgboost. However, the graph returns "f-values". I do not know which feature is being represented in the graph. One way I heard about how to solve this is mapping the index of the features within my dataframe to the index of the feature_importance "f-values" and selecting the columns manually. How do I go about in doing this? Also, if there is another way in doing this, help would truly be appreciated: Here is my code below: feature

FeatureUnion in scikit klearn and incompatible row dimension

阅读更多关于 FeatureUnion in scikit klearn and incompatible row dimension

问题 I have started to use scikit learn for text extraction. When I use standard function CountVectorizer and TfidfTransformer in a pipeline and when I try to combine with new features ( a concatention of matrix) I have got a row dimension problem. This is my pipeline: pipeline = Pipeline([('feats', FeatureUnion([ ('ngram_tfidf', Pipeline([('vect', CountVectorizer()),'tfidf', TfidfTransformer())])), ('addned', AddNed()),])), ('clf', SGDClassifier()),]) This is my class AddNEd which add 30 news

Get predictions on test sets in MLR

阅读更多关于 Get predictions on test sets in MLR

问题 I'm fitting classification models for binary issues using MLR package in R. For each model, I perform a cross-validation with embedded feature selection using "selectFeatures" function and retrieve mean AUCs over test sets. I would like next to retrieve predictions on the test sets for each fold but this function does not seem to support that. I already tried to plug selected predictors into the "resample" function to get it. It works, but performance metrics are different which is not

How to remove particular attributes from arff file and produce modified arff?

阅读更多关于 How to remove particular attributes from arff file and produce modified arff?

问题 (not manually) i have 96 features and want to remove some 20 features from arff and produce modified arff. used weka for feature selection now want to remove those less imp features. can anyone suggest code for this 回答1: Here you go... just change the source and destination file path... import java.io.File; import weka.core.Instances; import weka.core.converters.ArffLoader; import weka.core.converters.ArffSaver; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove;

Getting names and number of selected features before giving to a classifier in sklearn pipeline

阅读更多关于 Getting names and number of selected features before giving to a classifier in sklearn pipeline

问题 I am using sel = SelectFromModel(ExtraTreesClassifier(10), threshold='mean') to select the most important features in my data set. Then I want to feed these selected features to my keras classifier. But my keras based Neural Network classifier needs the number of imprtant features selected in the first step. Below is the code for my keras classifier and the variable X_new is the numpy array of new features selected. The code for keras classifier is as under. def create_model( dropout=0.2): n

How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

阅读更多关于 How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

问题 I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show.