feature-selection

Perform Chi-2 feature selection on TF and TF*IDF vectors

北城以北 提交于 2019-12-02 18:33:11
I'm experimenting with Chi-2 feature selection for some text classification tasks. I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom. Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2 , This score can be used to select the n_features

find important features for classification

☆樱花仙子☆ 提交于 2019-12-02 18:21:58
I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial. What I would like to do is after the classification to see which features were the most useful in classifying the trials. How can I do that and is it possible to test the significance of these features? e.g. to say that the

How to use scikit-learn PCA for features reduction and know which features are discarded

情到浓时终转凉″ 提交于 2019-12-02 16:18:42
I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples. Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way: from sklearn.decomposition import PCA nf = 100 pca = PCA(n_components=nf) # X is the matrix transposed (n samples on the rows, m features on the columns) pca.fit(X) X_new = pca.transform(X) Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones? Thanks The features that your

How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

南楼画角 提交于 2019-12-01 23:43:25
I am running a machine learning model (Ridge Regression w/ Cross-Validation) using scikit-learn's RidgeCV() method. My data set has 5 categorical features and 2 numerical ones, so I started with LabelEncoder() to convert the categorical features to integers, and then I applied OneHotEncoder() to make several new feature columns of 0s and 1s, in order to apply my Machine Learning model. My X_train is now a numpy array, and after fitting the model I am getting its coefficients, so I'm wondering -- is there a straightforward way to connect these coefficients back to the individual features they

Normalizing feature values for SVM

冷暖自知 提交于 2019-12-01 16:54:24
I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1) Let's suppose I have 3 features with values in ranges of: 3 - 5. 0.02 - 0.05 10-15. How do I convert all of those values into range of [0,1]? What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1... How do I normalize values during training to account for the

Should Feature Selection be done before Train-Test Split or after?

被刻印的时光 ゝ 提交于 2019-12-01 14:24:56
Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is that, if only the Training Set chosen from the whole dataset is used for Feature Selection, then the feature selection or feature importance score orders is likely to be dynamically changed with change in random_state of the Train_Test_Split. And if the feature selection for any particular work changes, then no Generalization of Feature Importance can

feature selection using logistic regression

柔情痞子 提交于 2019-12-01 13:20:40
I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider

feature selection using logistic regression

会有一股神秘感。 提交于 2019-12-01 09:48:51
问题 I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features 回答1:

Making the features of test data same as train data after featureselection in spark

流过昼夜 提交于 2019-12-01 09:21:24
问题 I m working on Scala. I have a big question, ChiSqSelector seems to reduce dimension successfully, but I can't identify what features were reduced what were remained. How can I know what features were reduced? [WrappedArray(a, b, c),(5,[1,2,3],[1,1,1]),(2,[0],[1])] [WrappedArray(b, d, e),(5,[0,2,4],[1,1,2]),(2,[1],[2])] [WrappedArray(a, c, d),(5,[0,1,3],[1,1,1]),(2,[0],[1])] PS: when I wanted to make the test data same as feature-selected train data I found that I dont know how to do that in

glmulti Oversized candidate set

时光毁灭记忆、已成空白 提交于 2019-12-01 04:33:50
Error message: SYSTEM: win7/64bit/ultimate/16gb-real-ram plus virtual memory, memory.limit(32000) What does this error message mean? In glmulti(y = "y", data = mydf, xr = c("x1", : !Oversized candidate set. mydf has 3.6mm rows & 150 columns of floats What steps to take to workaround it in glmulti? Any alternatives to glmulti in R world? R/64bit "Good Sport" I have encountered the same problem, here is what I have found out so far: The number of rows does not seem to be the issue. The issue is that with 150 predictors the package can't handle an exhaustive search (that is take a look and