feature-selection

Feature selection in document-feature matrix by using chi-squared test

三世轮回 提交于 2021-02-06 12:49:32
问题 I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r) I learned that I could use

LinearSVC Feature Selection returns different coef_ in Python

自闭症网瘾萝莉.ら 提交于 2021-01-29 10:19:01
问题 I'm using SelectFromModel with a LinearSVC on a training data set. The training and testing set had been already split and are saved in separate files. When I fit the LinearSVC on the training set I get a set of coef_[0] which I try to find the most important features. When I rerun the script i get different coef_[0] values even though it is on the same training data. Why is this the case? See below for snip of code (maybe there's a bug I don't see): fig = plt.figure() #SelectFromModel lsvc =

How to manually select the features of the decision tree

╄→尐↘猪︶ㄣ 提交于 2021-01-29 07:49:18
问题 I need to be able to change the features (with the machine learning meaning) that are used to build the decision tree. Given the example of the Iris Dataset, I want to be able to select the Sepallength as the feature used in the root node and the Petallength as a feature used in the nodes of the first level, and so on. I want to be clear, my aim is not to change the minimum sample split and the random state of the decision tree. But rather to select the features - the characteristics of the

Perform feature selection using pipeline and gridsearch

吃可爱长大的小学妹 提交于 2020-12-12 11:47:33
问题 As part of a research project, I want to select the best combination of preprocessing techniques and textual features that optimize the results of a text classification task. For this, I am using Python 3.6. There are a number of methods to combine features and algorithms, but I want to take full advantage of sklearn's pipelines and test all the different (valid) possibilities using grid search for the ultimate feature combo. My first step was to build a pipeline that looks like the following

Perform feature selection using pipeline and gridsearch

断了今生、忘了曾经 提交于 2020-12-12 11:46:15
问题 As part of a research project, I want to select the best combination of preprocessing techniques and textual features that optimize the results of a text classification task. For this, I am using Python 3.6. There are a number of methods to combine features and algorithms, but I want to take full advantage of sklearn's pipelines and test all the different (valid) possibilities using grid search for the ultimate feature combo. My first step was to build a pipeline that looks like the following

Determine most important feature per class

独自空忆成欢 提交于 2020-12-07 18:27:12
问题 Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features. I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used. What would be a good feature selection algorithm or heuristic that can do this? 回答1: When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called

Determine most important feature per class

倖福魔咒の 提交于 2020-12-07 18:23:23
问题 Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features. I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used. What would be a good feature selection algorithm or heuristic that can do this? 回答1: When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called