feature-selection

glmulti Oversized candidate set

我们两清 提交于 2019-12-01 03:16:54
问题 Error message: SYSTEM: win7/64bit/ultimate/16gb-real-ram plus virtual memory, memory.limit(32000) What does this error message mean? In glmulti(y = "y", data = mydf, xr = c("x1", : !Oversized candidate set. mydf has 3.6mm rows & 150 columns of floats What steps to take to workaround it in glmulti? Any alternatives to glmulti in R world? R/64bit "Good Sport" 回答1: I have encountered the same problem, here is what I have found out so far: The number of rows does not seem to be the issue. The

Fast Information Gain computation

时光总嘲笑我的痴心妄想 提交于 2019-11-30 13:22:31
问题 I need to compute Information Gain scores for >100k features in >10k documents for text classification . Code below works fine but for the full dataset is very slow - takes more than an hour on a laptop. Dataset is 20newsgroup and I am using scikit-learn, chi2 function which is provided in scikit works extremely fast. Any idea how to compute Information Gain faster for such dataset? def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = counts[np.nonzero(counts

Information Gain calculation with Scikit-learn

五迷三道 提交于 2019-11-30 04:50:59
I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute . But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia . Is it possible to use a specific setting for mutual

How is the feature score(/importance) in the XGBoost package calculated?

懵懂的女人 提交于 2019-11-30 04:38:04
The command xgb.importance returns a graph of feature importance measured by an f score . What does this f score represent and how is it calculated? Output: Graph of feature importance T. Scharf This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version. https://cran.r-project.org/web/packages/xgboost/xgboost.pdf It is about as basic a feature importance metric as you can get. i.e. How many times was this variable split on? The code for this method shows it is simply adding of the presence of a given feature in all the

Recursive feature elimination on Random Forest using scikit-learn

不羁的心 提交于 2019-11-30 03:55:03
I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to use a method that will explicitly tell me what features from my pandas DataFrame were selected in the

Random Forest Feature Importance Chart using Python

断了今生、忘了曾经 提交于 2019-11-29 20:41:59
I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used: from sklearn.ensemble import RandomForestRegressor MT= pd.read_csv("MT_reduced.csv") df = MT.reset_index(drop = False) columns2 = df.columns.tolist() # Filter the columns to remove ones we don't want. columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]] # Store the variable we'll be predicting on. target = "Property_crime_rate" # Let’s randomly split our data with 80% as the

R caret package rfe never finishes error task 1 failed - “replacement has length zero”

ぃ、小莉子 提交于 2019-11-29 11:56:17
I recently started to look into caret package for a model I'm developing. I'm using the latest version. As the first step, I decided to use it for feature selection. The data I'm using has about 760 features and 10k observations. I created a simple function based on the training material on line. Unfortunately, I consistently get an error and so the process never finishes. Here is the code that produces error. In this example I am using a small subset of features. I started with the full set of features. I've also changed the subsets, number of folds and repeats to no avail. I know it will be

Information Gain calculation with Scikit-learn

强颜欢笑 提交于 2019-11-29 02:14:41
问题 I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information.

How is the feature score(/importance) in the XGBoost package calculated?

谁都会走 提交于 2019-11-29 01:54:45
问题 The command xgb.importance returns a graph of feature importance measured by an f score . What does this f score represent and how is it calculated? Output: Graph of feature importance 回答1: This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf It is about as basic a feature importance metric as you can get. i.e. How many times was this variable split on?

Recursive feature elimination on Random Forest using scikit-learn

旧城冷巷雨未停 提交于 2019-11-29 01:03:26
问题 I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to