feature-selection

R using my own model in RFE(recursive feature elimination) to pick important feature

落花浮王杯 提交于 2019-12-13 21:01:13
问题 Using RFE, you can get a importance rank of the features, but right now I can only use the model and parameter inner the package like: lmFuncs(linear model),rfFuncs(random forest) it seems that caretFuncs can do some custom settings for your own model and parameter,but I don't know the details and the formal document didn't give detail, I want to apply svm and gbm to this RFE process,because this is the current model I used to train, anyone has any idea? 回答1: I tried to recreate working

Optimizing number of optimum features

孤街醉人 提交于 2019-12-13 10:23:47
问题 I am training neural network using Keras. Every time I train my model, I use slightly different set of features selected using Tree-based feature selection via ExtraTreesClassifier() . After training every time, I compute the AUCROC on my validation set and then go back in a loop to train the model again with different set of feature. This process is very inefficient and I want to select the optimum number of features using some optimization technique available in some python library. The

Python: feature selection in sci-kit learn for a normal distribution

不打扰是莪最后的温柔 提交于 2019-12-12 16:05:14
问题 I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most. I created some extra columns to measure the "success" which I define as just % attended relative to invites: my_data['invited'] = my_data.count(axis=1) my_data[

Feature selection on a keras model

痞子三分冷 提交于 2019-12-12 12:33:00
问题 I was trying to find the best features that dominate for the output of my regression model, Following is my code. seed = 7 np.random.seed(seed) estimators = [] estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=3, batch_size=20))) pipeline = Pipeline(estimators) rfe = RFE(estimator= pipeline, n_features_to_select=5) fit = rfe.fit(X_set, Y_set) But I get the following runtime error when running. RuntimeError: The classifier does not expose "coef_" or "feature_importances_

Mapping the index of the feat importances to the index of columns in a dataframe

匆匆过客 提交于 2019-12-12 04:25:18
问题 Hello I plotted a graph using feature_importance from xgboost. However, the graph returns "f-values". I do not know which feature is being represented in the graph. One way I heard about how to solve this is mapping the index of the features within my dataframe to the index of the feature_importance "f-values" and selecting the columns manually. How do I go about in doing this? Also, if there is another way in doing this, help would truly be appreciated: Here is my code below: feature

FeatureUnion in scikit klearn and incompatible row dimension

﹥>﹥吖頭↗ 提交于 2019-12-12 03:39:02
问题 I have started to use scikit learn for text extraction. When I use standard function CountVectorizer and TfidfTransformer in a pipeline and when I try to combine with new features ( a concatention of matrix) I have got a row dimension problem. This is my pipeline: pipeline = Pipeline([('feats', FeatureUnion([ ('ngram_tfidf', Pipeline([('vect', CountVectorizer()),'tfidf', TfidfTransformer())])), ('addned', AddNed()),])), ('clf', SGDClassifier()),]) This is my class AddNEd which add 30 news

Get predictions on test sets in MLR

天大地大妈咪最大 提交于 2019-12-12 01:24:40
问题 I'm fitting classification models for binary issues using MLR package in R. For each model, I perform a cross-validation with embedded feature selection using "selectFeatures" function and retrieve mean AUCs over test sets. I would like next to retrieve predictions on the test sets for each fold but this function does not seem to support that. I already tried to plug selected predictors into the "resample" function to get it. It works, but performance metrics are different which is not

How to remove particular attributes from arff file and produce modified arff?

被刻印的时光 ゝ 提交于 2019-12-11 13:28:48
问题 (not manually) i have 96 features and want to remove some 20 features from arff and produce modified arff. used weka for feature selection now want to remove those less imp features. can anyone suggest code for this 回答1: Here you go... just change the source and destination file path... import java.io.File; import weka.core.Instances; import weka.core.converters.ArffLoader; import weka.core.converters.ArffSaver; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove;

Getting names and number of selected features before giving to a classifier in sklearn pipeline

自闭症网瘾萝莉.ら 提交于 2019-12-11 06:35:08
问题 I am using sel = SelectFromModel(ExtraTreesClassifier(10), threshold='mean') to select the most important features in my data set. Then I want to feed these selected features to my keras classifier. But my keras based Neural Network classifier needs the number of imprtant features selected in the first step. Below is the code for my keras classifier and the variable X_new is the numpy array of new features selected. The code for keras classifier is as under. def create_model( dropout=0.2): n

How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

风格不统一 提交于 2019-12-11 06:08:11
问题 I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show.