feature-selection | 易学教程

Perform Chi-2 feature selection on TF and TF*IDF vectors

阅读更多关于 Perform Chi-2 feature selection on TF and TF*IDF vectors

I'm experimenting with Chi-2 feature selection for some text classification tasks. I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom. Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2 , This score can be used to select the n_features

find important features for classification

阅读更多关于 find important features for classification

I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial. What I would like to do is after the classification to see which features were the most useful in classifying the trials. How can I do that and is it possible to test the significance of these features? e.g. to say that the

How to use scikit-learn PCA for features reduction and know which features are discarded

阅读更多关于 How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples. Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way: from sklearn.decomposition import PCA nf = 100 pca = PCA(n_components=nf) # X is the matrix transposed (n samples on the rows, m features on the columns) pca.fit(X) X_new = pca.transform(X) Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones? Thanks The features that your

How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

阅读更多关于 How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

I am running a machine learning model (Ridge Regression w/ Cross-Validation) using scikit-learn's RidgeCV() method. My data set has 5 categorical features and 2 numerical ones, so I started with LabelEncoder() to convert the categorical features to integers, and then I applied OneHotEncoder() to make several new feature columns of 0s and 1s, in order to apply my Machine Learning model. My X_train is now a numpy array, and after fitting the model I am getting its coefficients, so I'm wondering -- is there a straightforward way to connect these coefficients back to the individual features they

Normalizing feature values for SVM

阅读更多关于 Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1) Let's suppose I have 3 features with values in ranges of: 3 - 5. 0.02 - 0.05 10-15. How do I convert all of those values into range of [0,1]? What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1... How do I normalize values during training to account for the

Should Feature Selection be done before Train-Test Split or after?

阅读更多关于 Should Feature Selection be done before Train-Test Split or after?

Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is that, if only the Training Set chosen from the whole dataset is used for Feature Selection, then the feature selection or feature importance score orders is likely to be dynamically changed with change in random_state of the Train_Test_Split. And if the feature selection for any particular work changes, then no Generalization of Feature Importance can

feature selection using logistic regression

阅读更多关于 feature selection using logistic regression

I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider

feature selection using logistic regression

阅读更多关于 feature selection using logistic regression

问题 I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features 回答1:

Making the features of test data same as train data after featureselection in spark

阅读更多关于 Making the features of test data same as train data after featureselection in spark

问题 I m working on Scala. I have a big question, ChiSqSelector seems to reduce dimension successfully, but I can't identify what features were reduced what were remained. How can I know what features were reduced? [WrappedArray(a, b, c),(5,[1,2,3],[1,1,1]),(2,[0],[1])] [WrappedArray(b, d, e),(5,[0,2,4],[1,1,2]),(2,[1],[2])] [WrappedArray(a, c, d),(5,[0,1,3],[1,1,1]),(2,[0],[1])] PS: when I wanted to make the test data same as feature-selected train data I found that I dont know how to do that in

glmulti Oversized candidate set

阅读更多关于 glmulti Oversized candidate set

Error message: SYSTEM: win7/64bit/ultimate/16gb-real-ram plus virtual memory, memory.limit(32000) What does this error message mean? In glmulti(y = "y", data = mydf, xr = c("x1", : !Oversized candidate set. mydf has 3.6mm rows & 150 columns of floats What steps to take to workaround it in glmulti? Any alternatives to glmulti in R world? R/64bit "Good Sport" I have encountered the same problem, here is what I have found out so far: The number of rows does not seem to be the issue. The issue is that with 150 predictors the package can't handle an exhaustive search (that is take a look and